Deploying a 15GB model.tar.gz : "no space left on device"

0

Hi, I am trying to deploy a PyTorchModel to an endpoint. The model artifact (a LLM-vigogne), zipped, is 15GB. I've taken various steps, such as : I used instance_type=ml.r5.4xlarge and volume_size=200, and I am still getting a "No space left on device" error when the file is untarred. The endpoint never even appear in the aws console. Could you please assist in resolving this issue? Thanks in advance for any ideas to try out. Best regards, Alizée

2 réponses
0

I guess that you use the code to deploy SageMaker endpoint, but what confuses me is what does it mean "The endpoint never even appear in the aws console"? Do you have access to the CloudWatch logs?

profile picture
kraft
répondu il y a 7 mois
  • Sorry for the imprecision. I mean in the endpoint section of AWS Sagemaker, in the AWS console (usually appears as "creating" when the endpoint is beoing deployed). As such I was not able to access the cloudwatch logs, as I usually use that to find the logs (there are heaps of various logs in cloudwatch and I do not know how to find mine).

0

So, you mean that no endpoint section show in AWS SageMaker Endpoint tab after you creating endpoint, right?
You can use the EventName: CreateEndpoint to search in cloudtrail event so that check some error when deploy endpoint.

profile picture
kraft
répondu il y a 7 mois

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions