Deploying a 15GB model.tar.gz : "no space left on device"

0

Hi, I am trying to deploy a PyTorchModel to an endpoint. The model artifact (a LLM-vigogne), zipped, is 15GB. I've taken various steps, such as : I used instance_type=ml.r5.4xlarge and volume_size=200, and I am still getting a "No space left on device" error when the file is untarred. The endpoint never even appear in the aws console. Could you please assist in resolving this issue? Thanks in advance for any ideas to try out. Best regards, Alizée

2 Answers
0

I guess that you use the code to deploy SageMaker endpoint, but what confuses me is what does it mean "The endpoint never even appear in the aws console"? Do you have access to the CloudWatch logs?

profile picture
kraft
answered 7 months ago
  • Sorry for the imprecision. I mean in the endpoint section of AWS Sagemaker, in the AWS console (usually appears as "creating" when the endpoint is beoing deployed). As such I was not able to access the cloudwatch logs, as I usually use that to find the logs (there are heaps of various logs in cloudwatch and I do not know how to find mine).

0

So, you mean that no endpoint section show in AWS SageMaker Endpoint tab after you creating endpoint, right?
You can use the EventName: CreateEndpoint to search in cloudtrail event so that check some error when deploy endpoint.

profile picture
kraft
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions