CUDA out of memory - Starcoder


Running into issues in getting Starcoder to deploy on Sagemaker.

I'm getting the following errors in CloudWatch and even with the instance type: ml.g5.8xlarge

Error 1:

Error: ShardCannotStart
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/", line 155, in serve, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/", line 647, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/", line 134, in get_model
    return santacoder_cls(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/", line 62, in __init__
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/", line 96, in load_weights
    value = if quantize is None else "cpu").to(dtype)

Error 2:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 22.20 GiB total capacity; 19.72 GiB already allocated; 143.12 MiB free; 21.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Error 3:

You are using a model of type gpt_bigcode to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.

Using the following for the deployment in Sagemaker studio:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration.
hub = {
    'SM_NUM_GPUS': json.dumps(1),

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
# send request
    "inputs": "def print_hello_world():",
2 Risposte
Risposta accettata

It worked by putting it on the AWS instance type: ml.g4dn.12xlarge and setting SM_NUM_GPUS: "4"

con risposta un anno fa

Hi, it seems that you also need to add HUGGING_FACE_HUB_TOKEN': "<YOUR HF TOKEN>" in your hub config to get it to work see for example



profile pictureAWS
con risposta un anno fa
  • I have the token in the deploy code as HF_API_TOKEN that's not the issue. The issue is the CUDA out of Memory errors and the Shard not able to start that are preventing the deployment from completing.

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande