Sagemaker Batch Transform Troubleshooting - 504 Gateway Timeout

0

The use case I am having an issue with is executing a batch transform job using our scikit model with a parquet file as input, which completes for a batch of 10 records but will not complete for 1000. The failure I’m seeing is “Bad HTTP status received from algorithm: 504” with “504 Gateway Time-out”. This error seems somewhat misleading as I’ve added writes of the prediction files to s3 within the output_fn specified in the entry_script which are completing successfully, multiple times per transform job. Specifically, what I seem to be seeing is (for the 1000 record parquet file, the 10 record file completes after a single pass): 1 - Valid request and successful prediction of the entire dataset and write to s3 2 - Another valid request and successful prediction of the same dataset and write to s3 (I’m assuming this is a retry) 3 - Job failure with a 504

In summary it seems like it is the response from the model to the transform job which is bad, resulting in a retry and eventual timeout, however I’m having real issues getting any helpful diagnostics. I have tried tweaking the parameters available when creating the Transformer object and when calling the transform method, starting with the instructions in this documentation https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-errors.html.

Questions:

  1. Any recommendations on how to better troubleshoot how these transform jobs could be failing? Specifically, between the response returned in the entry_script and the handling of the response by the transform job?
  2. The resulting data from our prediction consists of multiple levels which the transform job can’t handle, it requires a response for each row in the initial dataset. We are handling this by having the output_fn in the entry_script write each of these levels as files directly to s3, which is working as expected for our 1000 row use case. Is there any way to just have the job return a success message after these files are written rather than having to send a response “prediction” for each row?
  3. Are there any specific configurations I should try for setting up the transformer and executing the transform job?
Dean
asked 6 months ago234 views
1 Answer
0

Greetings,

Please note that Batch Transform does not support Parquet files as of now. This is within the road map for the internal team, however I cannot say with certainty when would the feature be implemented. If you need further details or support, I request that you reach out via Support case with the details:

  • batch transform job arn
  • logs showing the error and how it starts
  • inference script or entry_point script for the batch transform
  • Dockerfile if you are using your own inference container.
  • sample data if possible

The third party link: [1] suggest that the backend input_fn function (inference script) can handle input of parquet format however from what I understand parquet seems to be not supported for Batch transform in SageMaker (you can use CSV or JSON with no issue). I quote the below line from link [2].

The input to batch transforms must be of a format that can be split into smaller files to process in parallel. These formats include CSV, JSON, JSON Lines, TFRecord and RecordIO.

Reference: [1] https://stackoverflow.com/questions/62415237/aws-sagemaker-using-parquet-file-for-batch-transform-job [2] https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html#your-algorithms-batch-code-run-image

AWS
SUPPORT ENGINEER
answered 6 months ago
profile picture
EXPERT
reviewed 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions