Sagemaker batch transform job synchronization issue

0

Hi,

We have a step-function state machine that includes launching a batch transform job for inference. When I try to launch multiple instances for inference, the job raises an ambiguous error ClientError: See job logs for more information, and only ~90% of files are finished. When I checked the logs, I saw one instance ended with

...
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:55142 "POST /invocations HTTP/1.1" 200 121
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706666953
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:118.77|#ModelName:model,Level:Model|#hostname:5174f765eb79,requestID:7f422f8f-e525-4504-a50f-557daeb9e82c,timestamp:1706669798
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,497 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,502 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669798502
2024-01-31T02:56:38,504 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669798
2024-01-31T02:56:38,582 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Running prediction
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:78.69|#ModelName:model,Level:Model|#hostname:5174f765eb79,requestID:03820383-be17-4844-acc9-1663321ff4fb,timestamp:1706669798
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 80
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:55142 "POST /invocations HTTP/1.1" 200 81
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706666953
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:38,583 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:1|#Level:Host|#hostname:5174f765eb79,timestamp:1706669798
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:91.54833984375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:17.018661499023438|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:15.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.112581904824939|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:1679|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,196 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:12790.87109375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,197 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2599.078125|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:56:58,197 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:18.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669818
2024-01-31T02:57:58,197 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:91.54833602905273|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:17.018665313720703|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:15.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:11.112581904824939|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:1679|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0|#Level:Host,device_id:0|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:12793.234375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2596.71484375|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878
2024-01-31T02:57:58,198 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:18.7|#Level:Host|#hostname:5174f765eb79,timestamp:1706669878

while other instances seem to be killed suddenly with no reason

...
2024-01-31T02:57:59,738 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:57896 "POST /invocations HTTP/1.1" 200 131
2024-01-31T02:57:59,738 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:a24f8c07d291,timestamp:1706666953
2024-01-31T02:57:59,739 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,739 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:2|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,743 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669879743
2024-01-31T02:57:59,744 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Running prediction
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 112
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.255.130:57896 "POST /invocations HTTP/1.1" 200 113
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0-stdout MODEL_METRICS - PredictionTime.Milliseconds:110.72|#ModelName:model,Level:Model|#hostname:a24f8c07d291,requestID:263606dc-8801-4738-9f1c-082f90bd7211,timestamp:1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:a24f8c07d291,timestamp:1706666953
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - QueueTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,855 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:0|#Level:Host|#hostname:a24f8c07d291,timestamp:1706669879
2024-01-31T02:57:59,860 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1706669879860
2024-01-31T02:57:59,861 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend received inference at: 1706669879

I then tried to run the same job with only 1 instance which finished with no issue. Therefore, it gives me the impression that the client somehow failed before all instances finishing their work but I don't know how to find the problem because there's no log.

The code looks like

...
batch_inference = sfn_tasks.SageMakerCreateTransformJob(
            self,
            "EquiformerBatchTransform",
            transform_job_name=sfn.JsonPath.string_at("$$.Execution.Name"),
            model_name=sfn.JsonPath.string_at("$.modelname"),
            integration_pattern=sfn.IntegrationPattern.RUN_JOB,
            max_payload=Size.mebibytes(100),
            model_client_options=sfn_tasks.ModelClientOptions(
                invocations_timeout=Duration.minutes(10),
                invocations_max_retries=1,
            ),
            transform_resources=sfn_tasks.TransformResources(
                instance_count=sfn.JsonPath.number_at("$.InstanceAllocationResult.counts"),
                instance_type=ec2.InstanceType.of(
                    ec2.InstanceClass.G4DN,
                    ec2.InstanceSize.XLARGE,
                ),
            ),
            transform_input=sfn_tasks.TransformInput(
                content_type="chemical/x-pdb",
                transform_data_source=sfn_tasks.TransformDataSource(
                    s3_data_source=sfn_tasks.TransformS3DataSource(
                        s3_uri=sfn.JsonPath.string_at("$.input"),
                        s3_data_type=sfn_tasks.S3DataType.S3_PREFIX,
                    ),
                ),
            ),
            transform_output=sfn_tasks.TransformOutput(
                s3_output_path=sfn.JsonPath.format(
                    "s3://ot-cache/{}",
                    sfn.JsonPath.string_at("$$.Execution.Name"),
                ),
            ),
...

Thanks a lot if anyone can give me some clues.

Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen