Sagemaker batch transform job synchronization issue
0
Hi,
We have a step-function state machine that includes launching a batch transform job for inference. When I try to launch multiple instances for inference, the job raises an ambiguous error ClientError: See job logs for more information
, and only ~90% of files are finished. When I checked the logs, I saw one instance ended with
...
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:55142 "POST /invocations HTTP/1.1" 200 121
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706666953
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:118.77|#ModelName:model,Level:Model|#hostname:5174f765eb79,requestID:7f422f8f-e525-4504-a50f-557daeb9e82c,timestamp:1706669798
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,502 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669798502
2024-01-31T02:56:38,504 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669798
2024-01-31T02:56:38,582 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Running prediction
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:78.69|#ModelName:model,Level:Model|#hostname:5174f765eb79,requestID:03820383-be17-4844-acc9-1663321ff4fb,timestamp:1706669798
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 80
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:55142 "POST /invocations HTTP/1.1" 200 81
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706666953
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:91.54833984375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:17.018661499023438|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:15.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.112581904824939|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:1679|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:12790.87109375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,197 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2599.078125|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,197 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:18.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:57:58,197 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:91.54833602905273|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:17.018665313720703|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:15.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.112581904824939|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:1679|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:12793.234375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2596.71484375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:18.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
while other instances seem to be killed suddenly with no reason
...
2024-01-31T02:57:59,738 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:57896 "POST /invocations HTTP/1.1" 200 131
2024-01-31T02:57:59,738 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:a24f8c07d291,timestamp:1706666953
2024-01-31T02:57:59,739 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,739 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:2|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,743 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669879743
2024-01-31T02:57:59,744 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Running prediction
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 112
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:57896 "POST /invocations HTTP/1.1" 200 113
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:110.72|#ModelName:model,Level:Model|#hostname:a24f8c07d291,requestID:263606dc-8801-4738-9f1c-082f90bd7211,timestamp:1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:a24f8c07d291,timestamp:1706666953
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,860 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669879860
2024-01-31T02:57:59,861 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669879
I then tried to run the same job with only 1 instance which finished with no issue. Therefore, it gives me the impression that the client somehow failed before all instances finishing their work but I don't know how to find the problem because there's no log.
The code looks like
...
batch_inference = sfn_tasks.SageMakerCreateTransformJob(
self,
"EquiformerBatchTransform",
transform_job_name=sfn.JsonPath.string_at("$$.Execution.Name"),
model_name=sfn.JsonPath.string_at("$.modelname"),
integration_pattern=sfn.IntegrationPattern.RUN_JOB,
max_payload=Size.mebibytes(100),
model_client_options=sfn_tasks.ModelClientOptions(
invocations_timeout=Duration.minutes(10),
invocations_max_retries=1,
),
transform_resources=sfn_tasks.TransformResources(
instance_count=sfn.JsonPath.number_at("$.InstanceAllocationResult.counts"),
instance_type=ec2.InstanceType.of(
ec2.InstanceClass.G4DN,
ec2.InstanceSize.XLARGE,
),
),
transform_input=sfn_tasks.TransformInput(
content_type="chemical/x-pdb",
transform_data_source=sfn_tasks.TransformDataSource(
s3_data_source=sfn_tasks.TransformS3DataSource(
s3_uri=sfn.JsonPath.string_at("$.input"),
s3_data_type=sfn_tasks.S3DataType.S3_PREFIX,
),
),
),
transform_output=sfn_tasks.TransformOutput(
s3_output_path=sfn.JsonPath.format(
"s3://ot-cache/{}",
sfn.JsonPath.string_at("$$.Execution.Name"),
),
),
...
Thanks a lot if anyone can give me some clues.
Language
English
asked 3 months ago189 viewslg...
No Answers
- Newest
- Most votes
- Most comments
Relevant content
- asked 2 years agolg...
- Accepted Answerasked 4 years agolg...
- asked 4 years agolg...
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago