Accelerating SageMaker Training Jobs running on AWS Trainium

5 minute read
Content level: Intermediate
1

Step by step guide on how to use ahead-of-time compilation to speed up SageMaker training jobs running on Amazon EC2 Trn1 (AWS Trainium) Instances by up to 10x using the neuron_parallel_compile utility.

Authored by Vijay Niles and Scott Perry

ML Frameworks, such as PyTorch and Tensorflow, leverage compilers that take high-level descriptions of machine learning models, often in the form of computational graphs, and translate them into lower-level representations that can be efficiently executed on specific hardware architectures (ex. GPUs, TPUs). These optimizations may include parallelization, vectorization, and other techniques to make better use of the available hardware resources. Pre-compiling models in advance can result in training jobs running up to 10x faster.

Trn1 (AWS Trainium) EC2 instances leverage the Neuron SDK, the software stack that includes the Neuron hardware driver, user tools, framework integration, and compiler. Before you are able to train your model on Trn1 (AWS Trainium) EC2 instances you must leverage the Neuron compiler to complete a compilation step which converts your model from the standard ML framework-level model to a Neuron Executable File Format (NEFF) binary. The Neuron Compiler accepts Machine Learning models in various formats (TensorFlow, PyTorch, XLA HLO) and optimizes them to run on Neuron devices.

The Neuron compiler has 2 compilation methods:

  • just-in-time (JIT) compilation (default)
  • ahead-of-time compilation with neuron_parallel_compile

PyTorch Neuron defaults to just-in-time (JIT) compilation of graphs during execution, this is where at every step, a graph is traced. If the traced graph varies from the previous executions, it is compiled by the neuron compiler. JIT compilation can be helpful to speed up developer workflow, however when using JIT, graphs are compiled sequentially which can lead to much longer compilation times than compared to neuron_parallel_compile.

To reduce this compilation time during execution, the neuron_parallel_compile utility is provided as part of PyTorch Neuron installation. The neuron_parallel_compile utility will extract graphs from a trial run of your script, perform parallel pre-compilation of the graphs, and populate the Neuron Cache on disk or AWS S3 URL location with the compiled graphs. This pre-compilation run should be limited to a few training steps (eg. <100), enough for the utility to explore all code branches present in your training script to extract the different graphs needed for full execution. Once, the neuron_parallel_compile finishes compilation of all graphs, it will copy all the compilation results into the Neuron Cache directory (which can be a specified S3 location). Therefore you are then able to specify the location of the Neuron Cache directory in subsequent training runs so that the precompiled graphs will be used - avoiding recompilation. To provide some benchmarks, the compilation process for Llama2 7B with a parallel compilation of 16 nodes using trn1.32xlarge for sequences of 4k tokens takes approximately 3 minutes, and with 4 nodes using trn1.32xlarge , it takes around 5 minutes.

With recent versions of the Neuron Deep Learning Containers (DLCs), you are now able to leverage neuron_parallel_compile utility with SageMaker training jobs by setting the RUN_NEURON_PARALLEL_COMPILE = "1" environment variable within the SageMaker Estimator class.

The specific steps to enable this workflow would be the following:

1. Ahead-of-time compilation SageMaker training run:

  1. The 1st training run will use ahead-of-time compilation by setting the RUN_NEURON_PARALLEL_COMPILE = "1" environment variable in the SageMaker Estimator class. Please note the values that are outputted from the training script when using neuron_parallel_compile are placeholder values and should be disregarded (ex: loss_value=0, etc.).
  2. You will also specific an S3 URL location to store your Neuron Persistant Cache files by using the NEURON_COMPILE_CACHE_URL environment variable in the SageMaker Estimator class. The Neuron SDK will check the specified S3 location for available Neuron Persistent Cache files as well as upload Neuron Persistent Cache files once the training job is complete.
  3. To minimize the number of training steps you can set the max_steps hyperparameters to <100 steps. You want to ensure you set the max-steps to a minimum number of steps that are enough for the neuron_parallel_compile utility to explore and extract the different graphs needed for full execution. In most cases 100 steps would suffice.

Please see below screenshots where the above environment variables and hyperparameters have been set in the SageMaker Estimator:

Enter image description here

2. Subsequent SageMaker training runs leveraging the precompiled neuron persistent cache files located in S3

  1. You would run subsequent training jobs without setting the RUN_NEURON_PARALLEL_COMPILE = "1" environment variable but still making sure to pass in the NEURON_COMPILE_CACHE_URL. The Neuron SDK will check the specified S3 URL location and download the Neuron Persistent Cache which will have all of the precompiled graphs to be used for subsequent training runs.

Please see the below screenshot of the changes to the SageMaker Estimator class for subsequent training runs:

Note: Please note that you will need to rerun the above compilation step if you change the training script or the model.

Enter image description here

You can download and example notebook here.


About the Authors

Vijay Niles is a Solutions Architect with AWS, where he supports Independent Software Vendors build industry leading software leveraging cloud native technologies. Vijay also has a keen interest in supporting organizations in integrating Artificial Intelligence and Machine Learning services to solve challenging business problems. Outside of work, he enjoys exploring the great outdoors and cooking cuisines from around the world.

Scott Perry is a Solutions Architect on the Annapurna ML accelerator team at AWS. Based in Canada, he helps customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. His interests include large language models, deep reinforcement learning, IoT, and genomics.”

profile pictureAWS
EXPERT
published 2 months ago545 views