CUDA out of memory - Starcoder

0

Running into issues in getting Starcoder to deploy on Sagemaker.

I'm getting the following errors in CloudWatch and even with the instance type: ml.g5.8xlarge

Error 1:

Error: ShardCannotStart
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 134, in get_model
    return santacoder_cls(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_santacoder.py", line 62, in __init__
    self.load_weights(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_santacoder.py", line 96, in load_weights
    value = value.to(device if quantize is None else "cpu").to(dtype)

Error 2:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 22.20 GiB total capacity; 19.72 GiB already allocated; 143.12 MiB free; 21.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Error 3:

You are using a model of type gpt_bigcode to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.

Using the following for the deployment in Sagemaker studio:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'bigcode/starcoder',
    'SM_NUM_GPUS': json.dumps(1),
    'HF_API_TOKEN': '<TOKEN>'
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface",version="0.8.2"),
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.8xlarge",
    container_startup_health_check_timeout=400,
    endpoint_name="Starcoder"
   )
  
# send request
predictor.predict({
    "inputs": "def print_hello_world():",
})
texnoob
질문됨 10달 전360회 조회
2개 답변
0
수락된 답변

It worked by putting it on the AWS instance type: ml.g4dn.12xlarge and setting SM_NUM_GPUS: "4"

texnoob
답변함 9달 전
0

Hi, it seems that you also need to add HUGGING_FACE_HUB_TOKEN': "<YOUR HF TOKEN>" in your hub config to get it to work see for example https://huggingface.co/bigcode/starcoder/discussions/48

Best,

Didier

profile pictureAWS
전문가
답변함 10달 전
  • I have the token in the deploy code as HF_API_TOKEN that's not the issue. The issue is the CUDA out of Memory errors and the Shard not able to start that are preventing the deployment from completing.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠