CreateTransformJob batching

0

I'm using CreateTransformJob to submit a sagemaker inference task. I have a single input file to process consisting of 25k jsonl records, the total filesize is 2.5MB and the model is a simple PyTorch text classifier.

I just want to confirm my understanding of how transform jobs work. I've configured the job to use Line batching. It seems though that because MaxPayloadInMB is an integer, setting it to 1 will result in 3 (not so mini) batches (i.e. each batch is 10k records)?

So assuming I use an instance with 4 vCPU's and MaxConcurrentTransforms=4 then it should run all 4 batches in parallel? But there doesn't seem to be any reason to use a larger instance / more instances to increase throughput since if my understanding is correct there's no way to explicitly set the minibatch size any smaller than MaxPayloadInMB // size_of_file_in_mb?

Am I correct? If I want smaller mini-batches do I need to manually split the file myself and then manually reassemble it?

"MaxConcurrentTransforms": max_concurrent_transforms, 
"MaxPayloadInMB": max_payload,
"BatchStrategy": "MultiRecord",
"TransformOutput": {
    "S3OutputPath": batch_output,
    "AssembleWith": "Line",
    "Accept": "application/jsonlines",
},
1 Answer
0
Accepted Answer

Hello Dave,

From the AWS documentation they mentioned that you can control the size of the mini-batches by using the BatchStrategy and MaxPayloadInMB parameters [1]. With BatchStrategy you can specify the number of records to include in a mini-batch for an inference request [2]. Where each record is a single unit of input data that inference can be made on. For example, a single line in a CSV file is a record. If the input data is very large you can set the value of the MaxPayloadInMB to 0 to stream the data to the algorithm. However, this feature is not supported for Amazon SageMaker built-in algorithms.

Moreover, to split input files into mini-batches when you create a batch transform job, set the SplitType parameter value to Line [1]. If SplitType is set to None or if an input file can't be split into mini-batches, SageMaker uses the entire input file in a single request. Please note that Batch Transform doesn't support CSV-formatted input that contains embedded newline characters. If you set SplitType to Line, you can then set the AssembleWith parameter to Line to concatenate the output records with a line delimiter. Thus, you do not need to manually reassemble the output. If you don't specify the AssembleWith parameter, by default the output records are concatenated in a binary format.

I hope that this information will be helpful.

References:

  1. https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html#batch-transform-large-datasets
  2. https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#API_CreateTransformJob_RequestParameters
Cebi
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions