Train large language model using Hugging Face and AWS Trainium

13 minute read
Content level: Intermediate
5

Step by Step guide how to deploy existing Hugging Face training scripts on Amazon EC2 Trn1 Instances, featuring AWS Trainium

Authored by Bruno Pistone and Matt McClean

AWS recently announced the general availability of Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances. Amazon EC2 Trn1 instances are powered by AWS Trainium chips, the second-generation machine learning (ML) accelerator purpose built by AWS for high performance deep learning (DL) training. Trn1 instances deliver the highest performance on training of popular natural language processing (NLP) models on AWS while offering up to 50% cost savings over comparable GPU-based EC2 instances. Data scientists can unlock these benefits, by using AWS Neuron SDK, which is integrated with leading ML frameworks and libraries, such as PyTorch, TensorFlow, Megatron-LM, and Hugging Face. Using Neuron, developers can train natural language processing (NLP), computer vision, and recommender models on Trainium with only a few lines of code changes.

Let's dive into the NLP domain. State-of-the-art models for this domain have changed dramatically in recent years and are now dominated by Transformer based models, such as BERT (Bidirectional Encoder Representations from Transformers), and GPT-3 (Generative Pre-trained Transformer). The Transformer architecture is characterized by the usage of an “Attention Mechanism”, which allows processing the entire input all at once, unlike other common Neural Network (NN) architectures, such as Recurrent Neural Networks (RNNs). Although this architecture is widely used in the NLP domain, it is gaining popularity in other ML areas such as Robotics, Health Care & Life Sciences (HCLS), and Computer Vision. For example the Vision Transformer (ViT) is a transformer model built for computer vision applications.

Trainium is ideal for Transformers

The usage of popular state-of-the-art Transformer architectures, such as the previously mentioned BERT, and GPT, has been simplified by the open-source AI community Hugging Face, with their Python-based library called Transformers. Transformer models are compute intensive, and ideally suited to be trained on AWS Trainium devices with it's dedicated matrix multiplication engine and parallel based computational design. In this post, you will learn how to train a Hugging Face BERT model on AWS Trainium for a sentiment analysis use case. We will use PyTorch and show you how to set up a trn1.32xlarge instance and setup Neuron components with an Amazon Linux AMI. We will cover how you can adapt your PyTorch code to train a ML model by using a single NeuronCore, and run distributed training across the 32 NeuronCores in trn1.32xlarge instance.

Use case overview

Let us look at sentiment analysis use case using a dataset of Tweets. We will start from a pre-trained bert-based-cased model from Hugging Face and train a multi-class classification model to detect the sentiment value of “positive”, “negative”, or “neutral”. The full code, and the scripts used in this post are available on AWS Neuron Samples repo.

Enter image description here

Infrastructure Setup for AWS Trainium

To set up EC2 Trn1 instance with the necessary Neuron drivers and packages, you can use Ubuntu based Deep Learning Neuron AMI and an Amazon Linux 2 based Deep Learning Neuron AMI.

Launch an EC2 Trn1 instance based on the Ubutntu Deep Learning Neuron AMI by entering the text “Deep Learning AMI Neuron and selecting the QuickStart AMI and the trn1.32xlarge instance type in the Oregon or N. Virginia region* as shown below. The selected AMI provides a fully working environment for AWS Neuron, with all necessary modules for using AWS Neuron SDK already installed and configured, such as torch-neuronx, neuronx-cc, and numpy.

(*At the time of publication of this blog post, Trn1 instances are available in the Oregon and N. Virginia regions)

Enter image description here

To activate the pre-built PyTorch environment, run

source /opt/aws_neuron_venv_pytorch/bin/activate

Traditional Hugging Face Transformer Model Training

In order to train a NLP BERT model, we start from the Hugging Face guide for training a native PyTorch model, which is designed to be run on a CPU or GPU based instance. We are going load the dataset from a csv file, fetch a pre-trained model using the Hugging Face Transformers library, fine-tune the model, and evaluate the training progress.

The DLAMI has many of necessary python modules installed including the Hugging Face Transformers library transfomers==4.24.0. AWS Neuron SDK requires a numpy version <=1.20.0. The datasets package will be used to encapsulate the training data in a DatasetDict object, tokenize it, and iterate over the data samples during the training process.

Let’s start from a Python script designed to be run on a CPU or GPU based instance.

import logging
import torch

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

num_epochs = 6
batch_size = 8

device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info("Device: {}".format(device))

The next code snippet reads the dataset from a .csv file, and create a DatasetDict used for tokenizing the input for the model.

import csv
import pandas as pd

train = pd.read_csv(
        "./../../data/train.csv",
        sep=',',
        quotechar='"',
        quoting=csv.QUOTE_ALL,
        escapechar='\\',
        encoding='utf-8',
        error_bad_lines=False
    )

train_dataset = Dataset.from_dict(train)

hg_dataset = DatasetDict({"train": train_dataset})

Next we perform tokenization and encoding on the initial dataset. We load a pre-trained tokenizer starting from the model name bert-base-cased and use a generic tokenizer class AutoTokenizer provided by the Hugging Face Transformers library.

from torch.utils.data import DataLoader
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

ds_encoded = hg_dataset.map(tokenize_and_encode, batched=True, remove_columns=["text"])

ds_encoded.set_format("torch")

train_dl = DataLoader(ds_encoded["train"], shuffle=True, batch_size=batch_size)

We load the pre-trained model for the Sequence Classification task by using the class AutoModelForSequenceClassification

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3).to(device)

And now we are ready to begin the training loop on a CPU or GPU based instance.

from time import gmtime, strftime
from tqdm.auto import tqdm
from torch.optim import AdamW
from transformers get_scheduler

current_timestamp = strftime("%Y-%m-%d-%H-%M", gmtime())

optimizer = AdamW(model.parameters(), lr=1.45e-4)

num_training_steps = num_epochs * len(train_dl)
progress_bar = tqdm(range(num_training_steps))
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

logger.info("Start training: {}".format(strftime("%Y-%m-%d %H:%M:%S", gmtime())))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

logger.info("End training: {}".format(strftime("%Y-%m-%d %H:%M:%S", gmtime())))

torch.save(model, "./../../models/checkpoints/{}/checkpoint.pt".format(current_timestamp))

A complete version of the script can be found here.

Adapt the code to train on a single NeuronCore in Trainium

Now we will adapt the training script so that it can run on an EC2 Trn1 instance. We will start by training on a single NeuronCore and then show how to utilize all 32 NeuronCores on the trn1.32xlarge.

In order to train the ML model on a single NeuronCore, we need to make some small changes to our original code. The AWS Neuron SDK is plugged into PyTorch through the PyTorch/XLA module, which is used to train the PyTorch model on an XLA compatible device, such as AWS Trainium. PyTorch/XLA is a Python package built on top of the XLA deep learning compiler, a domain-specific compiler for linear algebra that can accelerate TensorFlow and PyTorch models. The PyTorch/XLA package is used for connecting the PyTorch framework with AWS Trainium, with minimal code changes.

The first change we have to do is to import the PyTorch/XLA xla module and set our device as xla.

import logging
import torch_xla.core.xla_model as xm

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

num_epochs = 6
batch_size = 8

device = "xla"
logger.info("Device: {}".format(device))

The rest of the code remains the same as what we saw in the previous section for CPU/GPU. The last change we have to do is to add a single line of code, xm.mark_step() just after the lr_scheduler.step() call. By calling this method at the end of each batch iteration, XLA will run its current graph, will update the model’s parameters, and will notify the end of a training step to the NeuronCore.

In order to make sure we properly save the model weights from the Trainium device to the instance storage device we use the XLA command: xm.save() to save the model weights to disk.

from time import gmtime, strftime
from tqdm.auto import tqdm
from torch.optim import AdamW
from transformers AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3).to(device)

current_timestamp = strftime("%Y-%m-%d-%H-%M", gmtime())

optimizer = AdamW(model.parameters(), lr=1.45e-4)

num_training_steps = num_epochs * len(train_dl)
progress_bar = tqdm(range(num_training_steps))
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

logger.info("Start training: {}".format(strftime("%Y-%m-%d %H:%M:%S", gmtime())))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
       lr_scheduler.step()
        xm.mark_step()
        optimizer.zero_grad()
        progress_bar.update(1)

logger.info("End training: {}".format(strftime("%Y-%m-%d %H:%M:%S", gmtime())))

os.makedirs("./../../models/checkpoints/{}".format(current_timestamp), exist_ok=True)
checkpoint = {"state_dict": model.state_dict()}
xm.save(checkpoint, "./../../models/checkpoints/{}/checkpoint.pt".format(current_timestamp))

The two scripts detailed in this paragraph are available on GitHub.

In order to run the training script on a single NeuronCore, we have to run

python3 train.py

For monitoring the NeuronCore used during the training, we run the command neuron-top and check the core utilization Enter image description here

In this image, we can see that, after the model is compiled by performing this operation on the CPU, the activities on it are close to 0%, and the training of the model is performed on the 1st NeuronCore (NC) of the first Neuron Device (ND) whose utilization ranges between 80% and 100%.

Adapt the code to distribute training across all the NeuronCores in Trainium

In this section, we cover the small Python code changes needed to distribute training across all 32 NeuronCores available on the trn1.32xlarge instance. We will be use data parallelism to create worker process and copy of the model for each NeuronCore, shard the data across all of the workers, and aggregate the results together in the back propagation step to ensure the model updates are the same during training.

The first step is to import PyTorch distributed training modules such as torch_xla.distributed.xla_backend, initialize it through torch.distributed.init_process_group(), and grab the number of cores that will collaborate on the job execution, passed by the AWS Neuron SDK, through xm.xrt_world_size().

import logging
import torch_xla.core.xla_model as xm
import torch_xla.distributed.xla_backend

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

num_epochs = 6
batch_size = 8

device = "xla"

torch.distributed.init_process_group(device)

world_size = xm.xrt_world_size()

logger.info("Workers: {}".format(world_size))
logger.info("Device: {}".format(device))

The second step we need to do is to modify the data loading approach. Most of the code below will work whether we use a single NeuronCore or multiple NeuronCores. To use multiple cores and distribute the dataset across all cores, we need additional code to build a sampler and restrict data loading to only a portion of the dataset. This is done by checking the number of cores passed by AWS Neuron SDK and saved in the parameter world_size, and create a DistributedSampler, which is provided as attribute to the DataLoader object instantiated one step below.

Each portion of the dataset, created by the DistributedSampler on the basis of the number of NeuronCores provided, is loaded into each Neuron Device by using torch_xla.distributed.parallel_loader.MpDeviceLoader, through the iterable DataLoader object.

import csv
import pandas as pd
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torch_xla.distributed.parallel_loader as pl
from transformers import AutoTokenizer

train = pd.read_csv(
    "./../../data/train.csv",
    sep=',',
    quotechar='"',
    quoting=csv.QUOTE_ALL,
    escapechar='\\',
    encoding='utf-8',
    error_bad_lines=False
)

train_dataset = Dataset.from_dict(train)

hg_dataset = DatasetDict({"train": train_dataset})

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

ds_encoded = hg_dataset.map(tokenize_and_encode, batched=True, remove_columns=["text"])

ds_encoded.set_format("torch")

## Create a subsed of data sampler, for parallelizing the training across multiple cores
if world_size > 1:
    train_sampler = DistributedSampler(
        ds_encoded["train"],
        num_replicas=world_size,
        rank=xm.get_ordinal(),
        shuffle=True,
    )

train_dl = DataLoader(
    ds_encoded["train"],
    batch_size=batch_size,
    sampler=train_sampler,
    shuffle=False if train_sampler else True)
    
train_device_loader = pl.MpDeviceLoader(train_dl, device)

The ML model, defined by AutoModelForSequenceClassification.from_pretrained, is replicated across all the provided NeuronCores during the forward pass, and each model replica will handle the portion of the input dataset defined by the DistributedSampler.

During the training loop, with distributed training we have to gather all the gradient updates from the different NeuronCores, by using xm.optimizer_step(), and consolidate them all across the NeuronCores and issue the XLA device step computation.

from time import gmtime, strftime
from tqdm.auto import tqdm
from torch.optim import AdamW
from transformers AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3).to(device)

current_timestamp = strftime("%Y-%m-%d-%H-%M", gmtime())

optimizer = AdamW(model.parameters(), lr=1.45e-4)

num_training_steps = num_epochs * len(train_dl)
progress_bar = tqdm(range(num_training_steps))
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

logger.info("Start training: {}".format(strftime("%Y-%m-%d %H:%M:%S", gmtime())))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        optimizer.zero_grad()
        loss = outputs.loss
        loss.backward()
        lr_scheduler.step()
        xm.optimizer_step(optimizer)
        progress_bar.update(1)

logger.info("End training: {}".format(strftime("%Y-%m-%d %H:%M:%S", gmtime())))

os.makedirs("./../../models/checkpoints/{}".format(current_timestamp), exist_ok=True)
checkpoint = {"state_dict": model.state_dict()}
xm.save(checkpoint, "./../../models/checkpoints/{}/checkpoint.pt".format(current_timestamp))

In order to run the training script by using multiple NeuronCores and parallelize it, we have to run the following commands

export TOKENIZERS_PARALLELISM=false
torchrun --nproc_per_node=32 train.py

The first line is related to avoid the parallelization of the tokenization step, and we don’t want to do it as many times as many NeuronCores we want to use. We can provide a number of Neuron Nodes N that can be 1, 2, 8, or 32 nodes. Since we are using the trn1.32xlarge we want to use all 32 NeuronCores for this training job.

For monitoring the NeuronCore used during the training, we run the command neuron-top and check the cores utilization. In this case, we can see that the training job is spread across all 32 NeuronCores: the two NeuronCores (NC0 and NC1) present in all of the Neuron Devices (ND0...ND15). We can also see the memory utilization showing that the model has been loaded into every NeuronCore on the instance. Enter image description here

Train on Trainium with only a few lines of code changes

In this post, we saw that we could continue using PyTorch and needed minimal code changes to train a Hugging Face BERT model on AWS Trainium. If you want to learn more about AWS Trainium and the other ML chips provided by AWS, check the official documentation page for AWS Trainium and AWS Inferentia. To get started with these purpose built accelerators visit the AWS Neuron documentation and look at the official examples provided on the AWS Neuron Samples repo.


References

  1. https://aws.amazon.com/ec2/instance-types/trn1/
  2. https://github.com/aws-neuron/aws-neuron-samples

About the Authors

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size on helping them to to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertice are Machine Learning end to end, Machine Learning Industrialization and MLOps. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Matt McClean is the Annapurna ML Solution Architecture lead at AWS. His team helps customers with Machine Learning solutions based on AWS Inferentia and Trainium offerings. In his spare time, he is a passionate skier and cyclist.