Sagemaker Batch Transform Job Failure: Timeout Issue and Job Restarted Unexpectedly

0

I am using the batch transform function in SageMaker for the inference of my PyTorch model. I am using the same structure as https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container. The error is that my job will start multiple times on different workers if I choose multiple workers. Or it will repeat after finish if I choose 1 worker.

I think it should be some errors in timeout setup. I have tried to increase the keepalive_timeout and proxy_read_timeout in the serve file and tried the SAGEMAKER_MODEL_SERVER_TIMEOUT as an environment variable. But nothing worked. Could some one help? Thanks!

  • To understand the scenario better, can you share the error message and the code used for the setup?

답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인