CreateTransformJob batching

0

I'm using CreateTransformJob to submit a sagemaker inference task. I have a single input file to process consisting of 25k jsonl records, the total filesize is 2.5MB and the model is a simple PyTorch text classifier.

I just want to confirm my understanding of how transform jobs work. I've configured the job to use Line batching. It seems though that because MaxPayloadInMB is an integer, setting it to 1 will result in 3 (not so mini) batches (i.e. each batch is 10k records)?

So assuming I use an instance with 4 vCPU's and MaxConcurrentTransforms=4 then it should run all 4 batches in parallel? But there doesn't seem to be any reason to use a larger instance / more instances to increase throughput since if my understanding is correct there's no way to explicitly set the minibatch size any smaller than MaxPayloadInMB // size_of_file_in_mb?

Am I correct? If I want smaller mini-batches do I need to manually split the file myself and then manually reassemble it?

"MaxConcurrentTransforms": max_concurrent_transforms, 
"MaxPayloadInMB": max_payload,
"BatchStrategy": "MultiRecord",
"TransformOutput": {
    "S3OutputPath": batch_output,
    "AssembleWith": "Line",
    "Accept": "application/jsonlines",
},
1개 답변
0
수락된 답변

Hello Dave,

From the AWS documentation they mentioned that you can control the size of the mini-batches by using the BatchStrategy and MaxPayloadInMB parameters [1]. With BatchStrategy you can specify the number of records to include in a mini-batch for an inference request [2]. Where each record is a single unit of input data that inference can be made on. For example, a single line in a CSV file is a record. If the input data is very large you can set the value of the MaxPayloadInMB to 0 to stream the data to the algorithm. However, this feature is not supported for Amazon SageMaker built-in algorithms.

Moreover, to split input files into mini-batches when you create a batch transform job, set the SplitType parameter value to Line [1]. If SplitType is set to None or if an input file can't be split into mini-batches, SageMaker uses the entire input file in a single request. Please note that Batch Transform doesn't support CSV-formatted input that contains embedded newline characters. If you set SplitType to Line, you can then set the AssembleWith parameter to Line to concatenate the output records with a line delimiter. Thus, you do not need to manually reassemble the output. If you don't specify the AssembleWith parameter, by default the output records are concatenated in a binary format.

I hope that this information will be helpful.

References:

  1. https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html#batch-transform-large-datasets
  2. https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#API_CreateTransformJob_RequestParameters
Cebi
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠