SageMaker Batch Transform MultiRecord Fail with CSV as Input

0

I am trying to do some inference on some CSV file saved on s3 using BatchTransform with strategy ='MultiRecord' and assemble_with='Line'. The same system works with strategy ='SingleRecord', however I need it to be as efficient as possible. The main issue comes when I switch to MultiRecord with a small csv composed of two columns, both texts. With a max_payload = 6, the process is succesfull with a CSV of 181 samples (609.7KB), but with the same CSV with 182 samples (621.6KB), the process fails with a "message": "Worker died.". I imagined it had something to do with the memory limit of the instance I am using, so I switched to a ml.m5.2xlarge with 32GB of memory. When I switch to a max_payload = 7, suddenly the process works with the 182 samples of CSV (621.6KB), but it fails with anything bigger than that. Any ideas of what could be causing this?

The logs look like this

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,218 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 0%| | 0.00/8.68M [00:00<?, ?B/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,327 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 0%| | 20.0k/8.68M [00:00<00:47, 191kB/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,434 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 1%| | 100k/8.68M [00:00<00:17, 526kB/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,543 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 3%|▎ | 228k/8.68M [00:00<00:10, 841kB/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,651 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 6%|▌ | 509k/8.68M [00:00<00:05, 1.56MB/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,759 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 12%|█▏ | 1.04M/8.68M [00:00<00:02, 2.91MB/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,867 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 25%|██▍ | 2.14M/8.68M [00:00<00:01, 5.54MB/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,975 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 50%|████▉ | 4.31M/8.68M [00:00<00:00, 10.6MB/s]

2023-05-19T12:49:51.989+03:00 2023-05-19T09:49:51,976 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 100%|█████████▉| 8.65M/8.68M [00:00<00:00, 20.7MB/s]

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,129 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 100%|██████████| 8.68M/8.68M [00:00<00:00, 10.5MB/s]

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,129 [WARN ] W-9007-model_1.0-stderr MODEL_LOG -

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,131 [WARN ] W-9007-model_1.0-stderr MODEL_LOG - Downloading: 0%| | 0.00/615 [00:00<?, ?B/s]

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,894 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13647

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,895 [INFO ] W-9005-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13459

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,895 [INFO ] W-9006-model_1.0 TS_METRICS - W-9006-model_1.0.ms:15108|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,896 [INFO ] W-9006-model_1.0 TS_METRICS - WorkerThreadTime.ms:56|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,897 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1684489793897

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,898 [INFO ] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13705

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,898 [INFO ] W-9007-model_1.0 TS_METRICS - W-9007-model_1.0.ms:15111|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,899 [INFO ] W-9007-model_1.0 TS_METRICS - WorkerThreadTime.ms:42|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,896 [INFO ] W-9005-model_1.0 TS_METRICS - W-9005-model_1.0.ms:15111|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,902 [INFO ] W-9005-model_1.0 TS_METRICS - WorkerThreadTime.ms:108|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,906 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13399

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,906 [INFO ] W-9000-model_1.0 TS_METRICS - W-9000-model_1.0.ms:15148|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,907 [INFO ] W-9000-model_1.0 TS_METRICS - WorkerThreadTime.ms:73|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:53.990+03:00 2023-05-19T09:49:53,907 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Backend received inference at: 1684489793

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:53,989 [INFO ] W-9002-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13784

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:53,991 [INFO ] W-9002-model_1.0 TS_METRICS - W-9002-model_1.0.ms:15231|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:53,992 [INFO ] W-9002-model_1.0 TS_METRICS - WorkerThreadTime.ms:56|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489793

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,009 [INFO ] W-9001-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13663

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,009 [INFO ] W-9001-model_1.0 TS_METRICS - W-9001-model_1.0.ms:15249|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,010 [INFO ] W-9001-model_1.0 TS_METRICS - WorkerThreadTime.ms:103|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,038 [INFO ] W-9003-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13293

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,038 [INFO ] W-9003-model_1.0 TS_METRICS - W-9003-model_1.0.ms:15277|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,039 [INFO ] W-9003-model_1.0 TS_METRICS - WorkerThreadTime.ms:52|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,187 [INFO ] W-9004-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 13751

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,187 [INFO ] W-9004-model_1.0 TS_METRICS - W-9004-model_1.0.ms:15426|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794

2023-05-19T12:49:54.991+03:00 2023-05-19T09:49:54,188 [INFO ] W-9004-model_1.0 TS_METRICS - WorkerThreadTime.ms:87|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489794

2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,971 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:66.7|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839

2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,972 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:46.74238204956055|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839

2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,972 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:9.122749328613281|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839

2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,973 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:16.3|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839

2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,973 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:17504.14453125|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839

2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,974 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:13734.4453125|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839

2023-05-19T12:50:40.004+03:00 2023-05-19T09:50:39,974 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:44.8|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489839

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,909 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 60000

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,910 [ERROR] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,910 [ERROR] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error

2023-05-19T12:50:54.009+03:00 org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time

2023-05-19T12:50:54.009+03:00 #011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) [model-server.jar:?]

2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]

2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]

2023-05-19T12:50:54.009+03:00 #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]

2023-05-19T12:50:54.009+03:00 #011at java.lang.Thread.run(Thread.java:829) [?:?]

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,945 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9006 Worker disconnected. WORKER_MODEL_LOADED

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,946 [INFO ] W-9006-model_1.0 ACCESS_LOG - /169.254.255.130:54478 "POST /invocations HTTP/1.1" 500 72597

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,947 [INFO ] W-9006-model_1.0 TS_METRICS - Requests5XX.Count:1|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489779

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,947 [WARN ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-model_1.0-stderr

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,948 [WARN ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-model_1.0-stdout

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,949 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9006 in 1 seconds.

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,995 [INFO ] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1684489853995

2023-05-19T12:50:54.009+03:00 2023-05-19T09:50:53,997 [INFO ] W-9007-model_1.0-stdout MODEL_LOG - Backend received inference at: 1684489853

2023-05-19T12:50:55.010+03:00 2023-05-19T09:50:54,024 [INFO ] W-9006-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9006-model_1.0-stdout

2023-05-19T12:50:55.010+03:00 2023-05-19T09:50:54,024 [INFO ] W-9006-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9006-model_1.0-stderr

2023-05-19T12:50:57.010+03:00 2023-05-19T09:50:56,227 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9006

2023-05-19T12:50:57.010+03:00 2023-05-19T09:50:56,235 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Successfully loaded /opt/conda/lib/python3.8/site-packages/ts/configs/metrics.yaml.

2023-05-19T12:50:57.010+03:00 2023-05-19T09:50:56,235 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - [PID]428

2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,236 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9006

2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,235 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Torch worker started.

2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,237 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.13

2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,238 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9006.

2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,238 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1684489856238

2023-05-19T12:50:57.011+03:00 2023-05-19T09:50:56,269 [INFO ] W-9006-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1

2023-05-19T12:51:07.014+03:00 2023-05-19T09:51:06,351 [INFO ] W-9006-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 10083

2023-05-19T12:51:07.014+03:00 2023-05-19T09:51:06,351 [INFO ] W-9006-model_1.0 TS_METRICS - W-9006-model_1.0.ms:87564|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489866

2023-05-19T12:51:07.014+03:00 2023-05-19T09:51:06,352 [INFO ] W-9006-model_1.0 TS_METRICS - WorkerThreadTime.ms:30|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489866

2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,971 [INFO ] pool-3-thread-2 TS_METRICS - CPUUtilization.Percent:100.0|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899

2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,972 [INFO ] pool-3-thread-2 TS_METRICS - DiskAvailable.Gigabytes:46.74235534667969|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899

2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,972 [INFO ] pool-3-thread-2 TS_METRICS - DiskUsage.Gigabytes:9.12277603149414|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899

2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,972 [INFO ] pool-3-thread-2 TS_METRICS - DiskUtilization.Percent:16.3|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899

2023-05-19T12:51:40.023+03:00 2023-05-19T09:51:39,973 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:17432.91796875|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899

2023-05-19T12:51:40.024+03:00 2023-05-19T09:51:39,974 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:13805.67578125|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899

2023-05-19T12:51:40.024+03:00 2023-05-19T09:51:39,974 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:45.0|#Level:Host|#hostname:d9e41fe4df61,timestamp:1684489899

2023-05-19T12:51:54.028+03:00 2023-05-19T09:51:53,996 [INFO ] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 60000

2023-05-19T12:51:54.028+03:00 2023-05-19T09:51:53,997 [ERROR] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 2

2023-05-19T12:51:54.028+03:00 2023-05-19T09:51:53,997 [ERROR] W-9007-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error

2023-05-19T12:51:54.028+03:00 org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time

2023-05-19T12:51:54.028+03:00 #011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:199) [model-server.jar:?]

2023-05-19T12:51:54.028+03:00 #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)

Alex
asked a year ago537 views
1 Answer
1
Accepted Answer

There could be two possible reasons for the error message : "message": "Worker died.".

  1. This is commonly noticed when the instance exhausts the Memory usage. (This can be verified if you can check the Cloudwatch Metrics for the job run.)
  2. Model server timeouts during the job.

You can either use a larger instance for the job or try to set the higher value of following environment variables - worker and timeout in your script.

model_server_workers = int(os.environ.get(_params.MODEL_SERVER_WORKERS_ENV, num_cpus())) 
model_server_timeout = int(os.environ.get(_params.MODEL_SERVER_TIMEOUT_ENV, <Enter value here>))
AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions