By using AWS re:Post, you agree to the AWS re:Post Terms of Use

How do I prevent or troubleshoot out of memory issues in an Amazon SageMaker endpoint?

5 minute read
0

I want to prevent or troubleshoot out of memory issues in an Amazon SageMaker endpoint.

Resolution

To prevent or troubleshoot out of memory issues in a SageMaker endpoint, complete the following steps:

For CPU-based models

If you deploy a model that runs on CPU and has endpoint memory issues, then use the following best practices:

If you use SageMaker built-in algorithm containers, then use the model_server_workers parameter to limit the number of workers. For more information, see model_server_workers on the SageMaker website. Start with a value of 1, and then gradually increase the value to find the maximum number of workers that your endpoint can have.
Note: When you increase the model_server_workers value, you also increase the number of model copies that are created. As a result, your memory requirement increases.

To monitor memory usage on SageMaker, use Amazon CloudWatch.

If your endpoint instance type can only accommodate a single model copy, then increase the instance type to a type that has more memory. For more information, see Amazon SageMaker pricing.

To test the endpoint and monitor the memory usage while invocation runs locally, use SageMaker local mode. For more information, see Local Mode on the SageMaker website. Make sure that you use the same instance type for the local test for consistent results.

If you can't increase the memory of a single instance for your endpoint, then use auto scaling. Auto scaling allows you to automatically adjust the number of instances based on your workload demands for optimal performance and resource utilization. For more information, see Optimize your machine learning deployments with auto scaling on Amazon SageMaker.

To identify the instance type and configuration required for your endpoint, use the Inference Recommender.

For GPU-based models

If you deploy a model that runs on GPU and has endpoint memory issues, then use the following best practices:

To calculate the GPU memory required to load the model weights, use the following formula.
Example for running Llama 2 13B:

Model Size Calculation:
Parameters (13B) × 4 bytes (FP32) = 52 GB

Total Memory Required = Initial weights(52 GB) + Attention cache and Token Generation memory(4-10 GB)** + Additional overhead(2-3 GB) 

** depends on sequence length, batch strategy, model architecture)

Memory Precision Comparisons:
• FP32 (Full Precision):           Base reference ( 4 bytes for 1 parameter)
• FP16 (Half Precision):           1/2 of FP32
• BF16 (Brain Float 16):           1/2 of FP32
• INT8 (8-bit Integer):            1/4 of FP32

If your model requires more memory than the GPU has, then use quantization, tensor parallelism, and continuous batching to optimize performance. For more information, see Deployment of LLMs using Amazon SageMaker and Select an endpoint deployment configuration. If you deploy an LLM that's available on Hugging Face, then use the model memory estimator to identify estimated model memory requirements. For more information, see Model memory estimator on the Hugging Face website.

To identify the optimal batch size for your model and available GPU memory, see Improve throughput performance of Llama 2 models using Amazon SageMaker.

If your model handles long-range dependencies, then adjust the sequence length. For more information, see Boost inference performance for LLMs with new Amazon SageMaker containers.

Make sure that the model's GPU memory allocation is configured correctly. To track GPU memory consumption, use monitoring tools such as nvidia-smi. For more information, see System Management Interface SMI on the NVIDIA website. Also, to help identify and resolve GPU memory issues, enhance your inference script with extra logging statements.

Troubleshoot common memory-related errors

"botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (503) from primary with message "{"code": 503, "type": "ServiceUnavailableException", "message": "No worker is available to serve request: model"}"

If you receive the preceding error, then complete the following steps:

  • To identify memory-related issues, review your endpoint's CloudWatch logs.
  • To check that the endpoint instance can manage simultaneous requests, inspect your container configuration. Make sure that multiple workers are available to process incoming requests efficiently.
  • To support multiple workers, adjust the model_server_workers parameter. For more information, see model_server_workers on the SageMaker website. If you use frameworks such as TorchServe to deploy a model, then configure the minimum and maximum workers based on your use case.
  • To identify the optimal configuration for the endpoint, load test the endpoint. If your container doesn't have enough resources to handle multiple workers, then configure auto scaling to distribute the load to multiple instances.

"torch.cuda.OutOfMemoryError: CUDA out of memory."

If the preceding error occurs during the endpoint deployment phase, then complete the following steps:

  • Check the memory requirements of your model and review your configuration.
  • Use instance types that have larger per GPU memory, such as p4d.* and p5.* families. Or, use instances with multiple GPUs, such as g5.12xlarge and g5.48xlarge.
  • If your model can't fit a single GPU, then shard model weights across multiple GPUs.

If the preceding error occurs during inference, then your GPU doesn't have enough memory to handle the input request. To troubleshoot this issue, reduce the batch size to 1 and decrease the generation length to a single token. Then, monitor your GPU memory usage and gradually increase the batch size and generation length to determine your GPU maximum capacity.

Note: If you use Hugging Face's Accelerate library, then turn on DeepSpeed to decrease your GPU memory utilization. This method doesn't affect downstream performance. For more information, see Accelerate on the Hugging Face website.

AWS OFFICIAL
AWS OFFICIALUpdated 20 days ago