Sagemaker endpoint - torch.cuda.OutOfMemoryError Error

0

Hi Team, I am trying to create a endpoint on Sagemaker for a model using Huggingface API. I am trying to load the Falcon 40b Instruct model. I have tried loading this model on ml.g5.12xlarge and ml.g5.24xlarge. Both times while running the model, I get the following error -

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 22.20 GiB total capacity; 19.98 GiB already allocated; 267.12 MiB free; 20.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am not sure why I am running into this error as both these instances are designated instance to run this model. Both these instances should have enough GPU memory to load this model. I am getting the same error while loading other model such as Wizard Coder from hugging face. Not sure why this is happening, can you help me debug this issue?

The code I am using to launch the endpoint -

try:
	role = sagemaker.get_execution_role()
except ValueError:
	iam = boto3.client('iam')
	role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

volume_size_gb = 500
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'tiiuae/falcon-40b-instruct',
	'SM_NUM_GPUS': json.dumps(1)
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	image_uri=get_huggingface_llm_image_uri("huggingface",version="0.8.2"),
	env=hub,
	role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g5.24xlarge",
    
	container_startup_health_check_timeout=3000,
  )
  
# send request
predictor.predict({
	"inputs": "My name is Julien and I like to",
})
asked 10 months ago1122 views
1 Answer
1

For a model with 40b parameters the GPU memory of ml.g5.12xlarge and ml.g5.24xlarge is not enough. (¡¡¡40b of parameters!!!!).

You need Nvidia A100 GPU

p4d.24xlarge

or

p4de.24xlarge

profile picture
answered 10 months ago
  • Hi, Thank you for answering. Really appreciate it. I have a follow up question. I have deployed this model on Hugging face using AWS Endpoint on Nvidia Tesla T4 and it worked perfectly. Also, there is a blog from AWS on how to deploy this on ml.g5.24xlarge

    https://aws.amazon.com/blogs/machine-learning/deploy-falcon-40b-with-large-model-inference-dlcs-on-amazon-sagemaker/

    Can you check this out and let me know? The main question which I am trying to answer is that huggingface also offers AWS Endpoint to deploy this, and does so with Nvidia Tesla T4, which is more cost effective than the p4d.24xlarge, so if you can help me figure this out, I would be grateful. Thank you.

  • Hi, Thank you for answering. Really appreciate it. I have a follow up question. I have deployed this model on Hugging face using AWS Endpoint on Nvidia Tesla T4 and it worked perfectly. Also, there is a blog from AWS on how to deploy this on ml.g5.24xlarge

    https://aws.amazon.com/blogs/machine-learning/deploy-falcon-40b-with-large-model-inference-dlcs-on-amazon-sagemaker/

    Can you check this out and let me know? The main question which I am trying to answer is that huggingface also offers AWS Endpoint to deploy this, and does so with Nvidia Tesla T4, which is more cost effective than the p4d.24xlarge, so if you can help me figure this out, I would be grateful. Thank you.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions