Sagemaker Batch Transform Job Failure: Timeout Issue and Job Restarted Unexpectedly


I am using the batch transform function in SageMaker for the inference of my PyTorch model. I am using the same structure as The error is that my job will start multiple times on different workers if I choose multiple workers. Or it will repeat after finish if I choose 1 worker.

I think it should be some errors in timeout setup. I have tried to increase the keepalive_timeout and proxy_read_timeout in the serve file and tried the SAGEMAKER_MODEL_SERVER_TIMEOUT as an environment variable. But nothing worked. Could some one help? Thanks!

  • To understand the scenario better, can you share the error message and the code used for the setup?