CUDA out of memory - Starcoder


Running into issues in getting Starcoder to deploy on Sagemaker.

I'm getting the following errors in CloudWatch and even with the instance type: ml.g5.8xlarge

Error 1:

Error: ShardCannotStart
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/", line 155, in serve, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/", line 647, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/", line 134, in get_model
    return santacoder_cls(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/", line 62, in __init__
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/", line 96, in load_weights
    value = if quantize is None else "cpu").to(dtype)

Error 2:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 22.20 GiB total capacity; 19.72 GiB already allocated; 143.12 MiB free; 21.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Error 3:

You are using a model of type gpt_bigcode to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.

Using the following for the deployment in Sagemaker studio:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration.
hub = {
    'SM_NUM_GPUS': json.dumps(1),

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
# send request
    "inputs": "def print_hello_world():",
preguntada hace un año429 visualizaciones
2 Respuestas
Respuesta aceptada

It worked by putting it on the AWS instance type: ml.g4dn.12xlarge and setting SM_NUM_GPUS: "4"

respondido hace un año

Hi, it seems that you also need to add HUGGING_FACE_HUB_TOKEN': "<YOUR HF TOKEN>" in your hub config to get it to work see for example



profile pictureAWS
respondido hace un año
  • I have the token in the deploy code as HF_API_TOKEN that's not the issue. The issue is the CUDA out of Memory errors and the Shard not able to start that are preventing the deployment from completing.

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas