InternalServerError with SageMaker Batch transform job

0

We have a few thousand models on SageMaker backed by our own containers on ECR. We have a problem with one of the models that when we start a Batch transform job with it the job is pending, then after ~20 minutes the job is marked as failed with "InternalServerError: We encountered an internal error. Please try again.". Attempting the job again didn't help. Is there any way to debug this?

The models execution role, container, security group and subnets are set correctly. The containers are all in the same repo which the execution role has permission to access. The container image is ~3GB, but is definitely not the largest one that we have run, and all our other models run fine with the same job parameters.

  • I suggest reaching out to Support and creating a Support Case.

jmsmkn
質問済み 2年前194ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ