Hi Team, I am trying to create a endpoint on Sagemaker for a model using Huggingface API. I am trying to load the Falcon 40b Instruct model. I have tried loading this model on ml.g5.12xlarge and ml.g5.24xlarge.
Both times while running the model, I get the following error -
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 22.20 GiB total capacity; 19.98 GiB already allocated; 267.12 MiB free; 20.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am not sure why I am running into this error as both these instances are designated instance to run this model. Both these instances should have enough GPU memory to load this model.
I am getting the same error while loading other model such as Wizard Coder from hugging face. Not sure why this is happening, can you help me debug this issue?
The code I am using to launch the endpoint -
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
volume_size_gb = 500
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'tiiuae/falcon-40b-instruct',
'SM_NUM_GPUS': json.dumps(1)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="0.8.2"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.24xlarge",
container_startup_health_check_timeout=3000,
)
# send request
predictor.predict({
"inputs": "My name is Julien and I like to",
})
Hi, Thank you for answering. Really appreciate it. I have a follow up question. I have deployed this model on Hugging face using AWS Endpoint on Nvidia Tesla T4 and it worked perfectly. Also, there is a blog from AWS on how to deploy this on ml.g5.24xlarge
https://aws.amazon.com/blogs/machine-learning/deploy-falcon-40b-with-large-model-inference-dlcs-on-amazon-sagemaker/
Can you check this out and let me know? The main question which I am trying to answer is that huggingface also offers AWS Endpoint to deploy this, and does so with Nvidia Tesla T4, which is more cost effective than the p4d.24xlarge, so if you can help me figure this out, I would be grateful. Thank you.
Hi, Thank you for answering. Really appreciate it. I have a follow up question. I have deployed this model on Hugging face using AWS Endpoint on Nvidia Tesla T4 and it worked perfectly. Also, there is a blog from AWS on how to deploy this on ml.g5.24xlarge
https://aws.amazon.com/blogs/machine-learning/deploy-falcon-40b-with-large-model-inference-dlcs-on-amazon-sagemaker/
Can you check this out and let me know? The main question which I am trying to answer is that huggingface also offers AWS Endpoint to deploy this, and does so with Nvidia Tesla T4, which is more cost effective than the p4d.24xlarge, so if you can help me figure this out, I would be grateful. Thank you.