Deploying a 15GB model.tar.gz : "no space left on device"

0

Hi, I am trying to deploy a PyTorchModel to an endpoint. The model artifact (a LLM-vigogne), zipped, is 15GB. I've taken various steps, such as : I used instance_type=ml.r5.4xlarge and volume_size=200, and I am still getting a "No space left on device" error when the file is untarred. The endpoint never even appear in the aws console. Could you please assist in resolving this issue? Thanks in advance for any ideas to try out. Best regards, Alizée

2 回答
0

I guess that you use the code to deploy SageMaker endpoint, but what confuses me is what does it mean "The endpoint never even appear in the aws console"? Do you have access to the CloudWatch logs?

profile picture
kraft
已回答 7 个月前
  • Sorry for the imprecision. I mean in the endpoint section of AWS Sagemaker, in the AWS console (usually appears as "creating" when the endpoint is beoing deployed). As such I was not able to access the cloudwatch logs, as I usually use that to find the logs (there are heaps of various logs in cloudwatch and I do not know how to find mine).

0

So, you mean that no endpoint section show in AWS SageMaker Endpoint tab after you creating endpoint, right?
You can use the EventName: CreateEndpoint to search in cloudtrail event so that check some error when deploy endpoint.

profile picture
kraft
已回答 7 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则