SageMaker batch transform not loading CSV correctly

0

I am running a batch transform job that us uploading data from a CSV. The CSV is formatted as such

"joe annes rifle accesories discount"
"cute puppies for sale"
"Two dudes talk about sports"
"Smith & Wesson M&P 500 review"
"Glock vs 1911 handgun"

My code for creating the batch transform is below

elec_model = PyTorchModel(model_data='s3://some_path/binary-models/tar_models/14_10_2022__19_54_23_arms_ammunition.tar.gz',
                         role=role,
                         entry_point='torchserve_.py',
                         source_dir='source_dir',
                          framework_version='1.12.0',
                          py_version='py38')
nl_detector = elec_model.transformer(
                     instance_count = 1,
                     instance_type = 'ml.g4dn.xlarge', strategy="MultiRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")
nl_detector.transform("s3://brand-safety-training-data/trash", content_type="text/csv", split_type="Line")

When I run this code instead of the batch job taking the CSV and breaking up the examples with every space, which is what

split_type="Line" 

is telling the algorithm to do, but instead it just ingests all of the sentences in the above CSV, and outputs 1 probability. Also, if I do the same thing with the same code, but switch

strategy="MultiRecord"

to

strategy="SingleRecord"

so the one code block would look like this

nl_detector = elec_model.transformer(
                     instance_count = 1,
                     instance_type = 'ml.g4dn.xlarge', strategy="SingleRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")

The algorithm works correctly, and performs inference on all of the above sentences in the CSV correctly. Any reason why this is happening?

EDIT 1: When I print the input payload it looks like this

"joe annes rifle accesories discount"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 

"cute puppies for sale"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 

"Two dudes talk about sports"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle

Where each sentence is a inference example, and is separated by this statement

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle

So it seems like sagemaker is separating the inference examples. But when I try and pass these sentences into a huggingface tokenizer, the tokenizer tokenizes them like they are one inference example, when they should be 3 distinct inference examples.

asked a year ago357 views
1 Answer
0

Hi,

This issue is not quite related to SageMaker but how you pass the data to transformer tokenizer.

You are right on split_type="Line" which splits your CSV files by lines. However, MultiRecord will ask SageMaker to pack as many lines as possible, up to MaxPayloadInMB as described in this doc. The default value is 6 MB. SingleRecord, on the other hand, will pass lines one by one.

The text will be from Byte/IO stream, which essentially something like a string as follows, "joe annes rifle accesories discount\ncute puppies for sale\nTwo dudes talk about sports\nSmith & Wesson M&P 500 review\nGlock vs 1911 handgun" If we pass this directly to tokenizer, it will be treated as a single string.

You can firstly parse the string to a list like

["joe annes rifle accesories discount",  "cute puppies for sale", "Two dudes talk about sports", "Smith & Wesson M&P 500 review", "Glock vs 1911 handgun"]

before passing data to tokenizer, which will result in a (5, xxx) tensor. This could ensure transformer understand sentences individually.

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions