Deploying a 15GB model.tar.gz : "no space left on device"

0

Hi, I am trying to deploy a PyTorchModel to an endpoint. The model artifact (a LLM-vigogne), zipped, is 15GB. I've taken various steps, such as : I used instance_type=ml.r5.4xlarge and volume_size=200, and I am still getting a "No space left on device" error when the file is untarred. The endpoint never even appear in the aws console. Could you please assist in resolving this issue? Thanks in advance for any ideas to try out. Best regards, Alizée

Alizee
質問済み 7ヶ月前237ビュー
2回答
0

I guess that you use the code to deploy SageMaker endpoint, but what confuses me is what does it mean "The endpoint never even appear in the aws console"? Do you have access to the CloudWatch logs?

profile picture
kraft
回答済み 7ヶ月前
  • Sorry for the imprecision. I mean in the endpoint section of AWS Sagemaker, in the AWS console (usually appears as "creating" when the endpoint is beoing deployed). As such I was not able to access the cloudwatch logs, as I usually use that to find the logs (there are heaps of various logs in cloudwatch and I do not know how to find mine).

0

So, you mean that no endpoint section show in AWS SageMaker Endpoint tab after you creating endpoint, right?
You can use the EventName: CreateEndpoint to search in cloudtrail event so that check some error when deploy endpoint.

profile picture
kraft
回答済み 7ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ