Sagemaker batch transform job synchronization issue

Hi,

We have a step-function state machine that includes launching a batch transform job for inference. When I try to launch multiple instances for inference, the job raises an ambiguous error ClientError: See job logs for more information, and only ~90% of files are finished. When I checked the logs, I saw one instance ended with

...
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:55142 "POST /invocations HTTP/1.1" 200 121
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706666953
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:118.77|#ModelName:model,Level:Model|#hostname:5174f765eb79,requestID:7f422f8f-e525-4504-a50f-557daeb9e82c,timestamp:1706669798
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,502 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669798502
2024-01-31T02:56:38,504 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669798
2024-01-31T02:56:38,582 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Running prediction
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:78.69|#ModelName:model,Level:Model|#hostname:5174f765eb79,requestID:03820383-be17-4844-acc9-1663321ff4fb,timestamp:1706669798
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 80
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:55142 "POST /invocations HTTP/1.1" 200 81
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706666953
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:91.54833984375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:17.018661499023438|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:15.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.112581904824939|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:1679|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:12790.87109375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,197 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2599.078125|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,197 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:18.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:57:58,197 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:91.54833602905273|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:17.018665313720703|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:15.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.112581904824939|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:1679|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:12793.234375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2596.71484375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:18.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878

while other instances seem to be killed suddenly with no reason

...
2024-01-31T02:57:59,738 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:57896 "POST /invocations HTTP/1.1" 200 131
2024-01-31T02:57:59,738 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:a24f8c07d291,timestamp:1706666953
2024-01-31T02:57:59,739 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,739 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:2|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,743 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669879743
2024-01-31T02:57:59,744 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Running prediction
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 112
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:57896 "POST /invocations HTTP/1.1" 200 113
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:110.72|#ModelName:model,Level:Model|#hostname:a24f8c07d291,requestID:263606dc-8801-4738-9f1c-082f90bd7211,timestamp:1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:a24f8c07d291,timestamp:1706666953
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,860 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669879860
2024-01-31T02:57:59,861 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669879

I then tried to run the same job with only 1 instance which finished with no issue. Therefore, it gives me the impression that the client somehow failed before all instances finishing their work but I don't know how to find the problem because there's no log.

The code looks like

...
batch_inference = sfn_tasks.SageMakerCreateTransformJob(
            self,
            "EquiformerBatchTransform",
            transform_job_name=sfn.JsonPath.string_at("$$.Execution.Name"),
            model_name=sfn.JsonPath.string_at("$.modelname"),
            integration_pattern=sfn.IntegrationPattern.RUN_JOB,
            max_payload=Size.mebibytes(100),
            model_client_options=sfn_tasks.ModelClientOptions(
                invocations_timeout=Duration.minutes(10),
                invocations_max_retries=1,
            ),
            transform_resources=sfn_tasks.TransformResources(
                instance_count=sfn.JsonPath.number_at("$.InstanceAllocationResult.counts"),
                instance_type=ec2.InstanceType.of(
                    ec2.InstanceClass.G4DN,
                    ec2.InstanceSize.XLARGE,
                ),
            ),
            transform_input=sfn_tasks.TransformInput(
                content_type="chemical/x-pdb",
                transform_data_source=sfn_tasks.TransformDataSource(
                    s3_data_source=sfn_tasks.TransformS3DataSource(
                        s3_uri=sfn.JsonPath.string_at("$.input"),
                        s3_data_type=sfn_tasks.S3DataType.S3_PREFIX,
                    ),
                ),
            ),
            transform_output=sfn_tasks.TransformOutput(
                s3_output_path=sfn.JsonPath.format(
                    "s3://ot-cache/{}",
                    sfn.JsonPath.string_at("$$.Execution.Name"),
                ),
            ),
...

Thanks a lot if anyone can give me some clues.

Topics

Machine Learning & AI Management & Governance AWS Cloud Development Kit (CDK)

Relevant content

Sagemaker Batch Transform for a tensorflow model
Accepted Answer
Sarath
asked 9 months ago
Sagemaker Batch Transform Job Failure: Timeout Issue and Job Restarted Unexpectedly
AWS-User-8248794
asked 2 years ago
xgboost sagemaker batch transform job output in multiple lines
Accepted Answer
Andres_V
asked 4 years ago
How to avoid multiple “Completed” events from SageMaker batch transform job
mrtoadsc
asked 4 years ago
How can I use a SAW runbook to troubleshoot my AWS Batch job stuck in the RUNNABLE status?
AWS OFFICIALUpdated 7 months ago
Why is my AWS Batch job stuck in RUNNABLE status?
AWS OFFICIALUpdated 7 months ago
How do I cancel all RUNNABLE jobs from my job queue in AWS Batch?
AWS OFFICIALUpdated a year ago
How do I troubleshoot insufficient capacity error when launching my Amazon SageMaker resources?
AWS OFFICIALUpdated a year ago
How to trigger an AWS Mainframe Modernization batch job from Amazon EventBridge Scheduler?
EXPERT
Souma Suvra Ghosh
published 4 months ago
AWS Launches Data Lake for Nonprofits
EXPERT
Liz Young
published a year ago