Compiling a model for Inferentia or Trainium - fixing the No cached version found message
Walk through the options for compiling a model for inference using Inferentia or Trainium. You would need to do this if the model or the configuration you want isn't available in the Hugging Face cache.
Why compile?
If you are trying to load a Hugging Face model using one of the Hugging Face containers, you may get a message that says
"No cached version found for {model_id} with {neuron_config}. You can start a discussion to request it on https://huggingface.co/aws-neuron/optimum-neuron-cache . Alternatively, you can export your own neuron model as explained in https://huggingface.co/docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-neuronx-tgi"
You may also be having this problem if you get an error from SageMaker that your endpoint cannot be deployed (and you would find the error above in the CloudWatch log for this endpoint). (It is possible that you may see a message in SageMaker that says "your model has not been compiled for Inferentia" but it deploys anyway. You can safely ignore this error as long as the endpoint comes up.)
You are getting this message either because the model you are trying to use OR one of the configuration options you have specified hasn't been precompiled. It is also possible that the container you are trying to deploy with doesn't match the version of what has been cached. Any change to the model, TP/cores, batch size, sequence length, input estimate, compiler version, or other setting may require a recompile.
If you are deploying using the exact instructions from the Inferentia/Trainium tab on the "Deploy" dropdown on the Hugging Face model card, it is unlikely that you will see this error. However, if you change one of the configuration options or are trying to use the same instructions with a different model or container, you may need to compile.
Options for compiling
You can compile the model using the optimum-cli command from the Hugging Face Optimum Neuron library on an EC2 instance. To make sure that you are using all the same versions, it makes sense to use the container itself to do the compiling. You can see examples of how to do that here and here
Once you compile it, you can then upload it back to Hugging Face and make the Model_ID the compiled model path on Hugging Face, or you can reference it using the local model directory you compiled into. Make sure the local model directory is being passed into the container and you reference it using the path it is at INSIDE the container. For instance, if you use the option -v $(pwd):/data \
and your output path is JimsModel, you would make your Model_ID= "/data/JimsModel".
If you are using SageMaker to deploy the compiled model, you can now reference the compiled model path on Hugging Face instead of the original model path. If you want to load it from an S3 path, you will need to compress it into a model.tar.gz file and upload it.
tar -cf model.tar.gz --use-compress-program=pigz JimsModel
Upload it to an S3 bucket that SageMaker has permissions to and then make your Model_ID the S3 URI.
Your default SageMaker bucket is usually a good location and you can find it with this python script in a SageMaker notebook:
import sagemaker
print (sagemaker.Session())
Compiling as a SageMaker training job
If you do NOT have access to an Inferentia EC2 instance, it is possible to compile your model using a SageMaker training job (or you really want to impress your friends -- you probably don't want to be compiling this way otherwise). This will end up doing the same thing as the process above using the optimum-cli command inside the container along with the same options, but you will not require ssh access to an EC2 instance to do it.
You can run this from any command prompt with the aws client installed and configured. The easiest way to do this is to use the CloudShell in the AWS console.
aws --region us-west-2 sagemaker create-training-job \
--training-job-name TGIcompilationOutputNewFeb \
--role-arn arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-20240116T12345 \
--algorithm-specification '{"TrainingInputMode": "File",
"TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.22-neuronx-py310-ubuntu22.04-v1.0",
"ContainerEntrypoint": ["optimum-cli"],
"ContainerArguments": ["export", "neuron", "--model", "dacorvo/tiny-random-llama", "--sequence_length", "1024", "--batch_size", "1", "--num_cores", "2", "/opt/ml/output/data/tinyllamatestdocker/"]}' \
--output-data-config '{"S3OutputPath": "s3://sagemaker-us-west-2-123456789012/CompileTest/"}' \
--resource-config '{"VolumeSizeInGB":10,"InstanceCount":1,"InstanceType":"ml.trn1.32xlarge"}' \
--stopping-condition '{"MaxRuntimeInSeconds": 1800}'
A few comments:
Make sure that you requested a quota increase for a SageMaker training job with the correct instance type. Your request should match the InstanceType in the resource-config section above. There are no Inferentia instance types available for training jobs, so your choices are ml.trn1.2xlarge for num_cores=2, or ml.trn1.32xlarge for any num_cores>2.
Make sure the training-job-name is unique each run. It will also end up as part of the path on the output.
Replace your role and S3OutputPath. You can find your role listed in your SageMaker Domains Details in the Amazon SageMaker AI section of the AWS console. See the command above for a good idea for the S3 output path.
Update the ContainerArguments according the links for EC2 examples above. You will need to update your model to the original model path on Hugging Face. Your sequence_length, batch_size, num_cores, etc. all need to match what you will be trying to deploy.
For the TrainingImage, use the same image path that you will be using when you deploy. That will ensure that everything is the same version.
You may need to increase your VolumeSizeInGB for larger models.
Relevant content
- asked 2 years agolg...
- Accepted Answerasked 2 years agolg...
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 2 months ago