SageMaker Batch Transform MultiRecord Fail with CSV as Input
I am trying to do some inference on some CSV file saved on s3 using BatchTransform with strategy ='MultiRecord'
and assemble_with='Line'
. The same system works with strategy ='SingleRecord'
, however I need it to be as efficient as possible. The main issue comes when I switch to MultiRecord with a small csv composed of two columns, both texts.
With a max_payload = 6
, the process is succesfull with a CSV of 181 samples (609.7KB), but with the same CSV with 182 samples (621.6KB), the process fails with a "message": "Worker died."
. I imagined it had something to do with the memory limit of the instance I am using, so I switched to a ml.m5.2xlarge
with 32GB of memory.
When I switch to a max_payload = 7
, suddenly the process works with the 182 samples of CSV (621.6KB), but it fails with anything bigger than that. Any ideas of what could be causing this?
The logs look like this
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,218 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 0%| | 0.00/8.68M [00:00<?, ?B/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,327 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 0%| | 20.0k/8.68M [00:00<00:47, 191kB/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,434 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 1%| | 100k/8.68M [00:00<00:17, 526kB/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,543 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 3%|▎ | 228k/8.68M [00:00<00:10, 841kB/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,651 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 6%|▌ | 509k/8.68M [00:00<00:05, 1.56MB/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,759 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 12%|█▏ | 1.04M/8.68M [00:00<00:02, 2.91MB/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,867 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 25%|██▍ | 2.14M/8.68M [00:00<00:01, 5.54MB/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,975 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 50%|████▉ | 4.31M/8.68M [00:00<00:00, 10.6MB/s]
2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,976 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 100%|█████████▉| 8.65M/8.68M [00:00<00:00, 20.7MB/s]
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,129 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 100%|██████████| 8.68M/8.68M [00:00<00:00, 10.5MB/s]
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,129 [WARN ] W-9007-model_1.0-stderr MODEL_LOG -
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,131 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 0%| | 0.00/615 [00:00<?, ?B/s]
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,894 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13647
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,895 [INFO ] W-9005-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13459
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,895 [INFO ] W-9006-model_1.0 TS_METRICS - W-9006-model_1.0.ms:15108|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,896 [INFO ] W-9006-model_1.0 TS_METRICS - WorkerThreadTime.ms:56|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,897 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1684489793897
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,898 [INFO ] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13705
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,898 [INFO ] W-9007-model_1.0 TS_METRICS - W-9007-model_1.0.ms:15111|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,899 [INFO ] W-9007-model_1.0 TS_METRICS - WorkerThreadTime.ms:42|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,896 [INFO ] W-9005-model_1.0 TS_METRICS - W-9005-model_1.0.ms:15111|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,902 [INFO ] W-9005-model_1.0 TS_METRICS - WorkerThreadTime.ms:108|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,906 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13399
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,906 [INFO ] W-9000-model_1.0 TS_METRICS - W-9000-model_1.0.ms:15148|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,907 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:73|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,907 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Backend received inference at: 1684489793
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:53,989 [INFO ] W-9002-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13784
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:53,991 [INFO ] W-9002-model_1.0 TS_METRICS - W-9002-model_1.0.ms:15231|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:53,992 [INFO ] W-9002-model_1.0 TS_METRICS - WorkerThreadTime.ms:56|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,009 [INFO ] W-9001-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13663
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,009 [INFO ] W-9001-model_1.0 TS_METRICS - W-9001-model_1.0.ms:15249|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,010 [INFO ] W-9001-model_1.0 TS_METRICS - WorkerThreadTime.ms:103|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,038 [INFO ] W-9003-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13293
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,038 [INFO ] W-9003-model_1.0 TS_METRICS - W-9003-model_1.0.ms:15277|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,039 [INFO ] W-9003-model_1.0 TS_METRICS - WorkerThreadTime.ms:52|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,187 [INFO ] W-9004-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13751
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,187 [INFO ] W-9004-model_1.0 TS_METRICS - W-9004-model_1.0.ms:15426|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794
2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,188 [INFO ] W-9004-model_1.0 TS_METRICS - WorkerThreadTime.ms:87|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794
2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,971 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:66.7|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839
2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,972 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:46.74238204956055|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839
2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,972 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:9.122749328613281|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839
2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,973 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:16.3|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839
2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,973 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:17504.14453125|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839
2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,974 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:13734.4453125|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839
2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,974 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:44.8|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,909 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 60000
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,910 [ERROR] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,910 [ERROR] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
2023-05-19T12:50:54.009+03:00 org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
2023-05-19T12:50:54.009+03:00 #011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) [model-server.jar:?]
2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
2023-05-19T12:50:54.009+03:00 #011at java.lang.Thread.run(Thread.java:829) [?:?]
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,945 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9006 Worker disconnected. WORKER_MODEL_LOADED
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,946 [INFO ] W-9006-model_1.0 ACCESS_LOG - /169.254.255.130:54478 "POST /invocations HTTP/1.1" 500 72597
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,947 [INFO ] W-9006-model_1.0 TS_METRICS - Requests5XX.Count:1|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489779
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,947 [WARN ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-model_1.0-stderr
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,948 [WARN ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-model_1.0-stdout
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,949 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9006 in 1 seconds.
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,995 [INFO ] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1684489853995
2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,997 [INFO ] W-9007-model_1.0-stdout MODEL_LOG - Backend received inference at: 1684489853
2023-05-19T12:50:55.010+03:00 2023-05-19T09:50:54,024 [INFO ] W-9006-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9006-model_1.0-stdout
2023-05-19T12:50:55.010+03:00 2023-05-19T09:50:54,024 [INFO ] W-9006-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9006-model_1.0-stderr
2023-05-19T12:50:57.010+03:00 2023-05-19T09:50:56,227 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9006
2023-05-19T12:50:57.010+03:00 2023-05-19T09:50:56,235 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-05-19T12:50:57.010+03:00 2023-05-19T09:50:56,235 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - [PID]428
2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,236 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9006
2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,235 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Torch worker started.
2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,237 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.13
2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,238 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9006.
2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,238 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1684489856238
2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,269 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2023-05-19T12:51:07.014+03:00 2023-05-19T09:51:06,351 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 10083
2023-05-19T12:51:07.014+03:00 2023-05-19T09:51:06,351 [INFO ] W-9006-model_1.0 TS_METRICS - W-9006-model_1.0.ms:87564|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489866
2023-05-19T12:51:07.014+03:00 2023-05-19T09:51:06,352 [INFO ] W-9006-model_1.0 TS_METRICS - WorkerThreadTime.ms:30|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489866
2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,971 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899
2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,972 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:46.74235534667969|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899
2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,972 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:9.12277603149414|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899
2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,972 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:16.3|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899
2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,973 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:17432.91796875|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899
2023-05-19T12:51:40.024+03:00 2023-05-19T09:51:39,974 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:13805.67578125|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899
2023-05-19T12:51:40.024+03:00 2023-05-19T09:51:39,974 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:45.0|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899
2023-05-19T12:51:54.028+03:00 2023-05-19T09:51:53,996 [INFO ] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 60000
2023-05-19T12:51:54.028+03:00 2023-05-19T09:51:53,997 [ERROR] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 2
2023-05-19T12:51:54.028+03:00 2023-05-19T09:51:53,997 [ERROR] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
2023-05-19T12:51:54.028+03:00 org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
2023-05-19T12:51:54.028+03:00 #011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) [model-server.jar:?]
2023-05-19T12:51:54.028+03:00 #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
- Newest
- Most votes
- Most comments
There could be two possible reasons for the error message : "message": "Worker died.".
- This is commonly noticed when the instance exhausts the Memory usage. (This can be verified if you can check the Cloudwatch Metrics for the job run.)
- Model server timeouts during the job.
You can either use a larger instance for the job or try to set the higher value of following environment variables - worker and timeout in your script.
model_server_workers = int(os.environ.get(_params.MODEL_SERVER_WORKERS_ENV, num_cpus()))
model_server_timeout = int(os.environ.get(_params.MODEL_SERVER_TIMEOUT_ENV, <Enter value here>))
Relevant content
- asked 4 months agolg...
- asked 2 years agolg...
- asked 2 years agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 7 months ago
The issue ended up being that the model was timing out during the job. I added the same solution explained here and it worked: https://github.com/aws/sagemaker-mxnet-inference-toolkit/issues/77. Thanks for the help! :)