CUDA out of memory - Starcoder

0

Running into issues in getting Starcoder to deploy on Sagemaker.

I'm getting the following errors in CloudWatch and even with the instance type: ml.g5.8xlarge

Error 1:

Error: ShardCannotStart
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 134, in get_model
    return santacoder_cls(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_santacoder.py", line 62, in __init__
    self.load_weights(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_santacoder.py", line 96, in load_weights
    value = value.to(device if quantize is None else "cpu").to(dtype)

Error 2:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 22.20 GiB total capacity; 19.72 GiB already allocated; 143.12 MiB free; 21.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Error 3:

You are using a model of type gpt_bigcode to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.

Using the following for the deployment in Sagemaker studio:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'bigcode/starcoder',
    'SM_NUM_GPUS': json.dumps(1),
    'HF_API_TOKEN': '<TOKEN>'
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="0.8.2"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    container_startup_health_check_timeout=400,
    endpoint_name="Starcoder"
   )
  
# send request
predictor.predict({
    "inputs": "def print_hello_world():",
})
texnoob
asked 9 months ago335 views
2 Answers
0
Accepted Answer

It worked by putting it on the AWS instance type: ml.g4dn.12xlarge and setting SM_NUM_GPUS: "4"

texnoob
answered 9 months ago
0

Hi, it seems that you also need to add HUGGING_FACE_HUB_TOKEN': "<YOUR HF TOKEN>" in your hub config to get it to work see for example https://huggingface.co/bigcode/starcoder/discussions/48

Best,

Didier

profile pictureAWS
EXPERT
answered 9 months ago
  • I have the token in the deploy code as HF_API_TOKEN that's not the issue. The issue is the CUDA out of Memory errors and the Shard not able to start that are preventing the deployment from completing.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions