By using AWS re:Post, you agree to the Terms of Use

Sagemaker Batch Transform Job Failure: Timeout Issue and Job Restarted Unexpectedly

0

I am using the batch transform function in SageMaker for the inference of my PyTorch model. I am using the same structure as https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container. The error is that my job will start multiple times on different workers if I choose multiple workers. Or it will repeat after finish if I choose 1 worker.

I think it should be some errors in timeout setup. I have tried to increase the keepalive_timeout and proxy_read_timeout in the serve file and tried the SAGEMAKER_MODEL_SERVER_TIMEOUT as an environment variable. But nothing worked. Could some one help? Thanks!

  • To understand the scenario better, can you share the error message and the code used for the setup?