SageMaker batch transform not loading CSV correctly

0

I am running a batch transform job that us uploading data from a CSV. The CSV is formatted as such

"joe annes rifle accesories discount"
"cute puppies for sale"
"Two dudes talk about sports"
"Smith & Wesson M&P 500 review"
"Glock vs 1911 handgun"

My code for creating the batch transform is below

elec_model = PyTorchModel(model_data='s3://some_path/binary-models/tar_models/14_10_2022__19_54_23_arms_ammunition.tar.gz',
                         role=role,
                         entry_point='torchserve_.py',
                         source_dir='source_dir',
                          framework_version='1.12.0',
                          py_version='py38')
nl_detector = elec_model.transformer(
                     instance_count = 1,
                     instance_type = 'ml.g4dn.xlarge', strategy="MultiRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")
nl_detector.transform("s3://brand-safety-training-data/trash", content_type="text/csv", split_type="Line")

When I run this code instead of the batch job taking the CSV and breaking up the examples with every space, which is what

split_type="Line" 

is telling the algorithm to do, but instead it just ingests all of the sentences in the above CSV, and outputs 1 probability. Also, if I do the same thing with the same code, but switch

strategy="MultiRecord"

to

strategy="SingleRecord"

so the one code block would look like this

nl_detector = elec_model.transformer(
                     instance_count = 1,
                     instance_type = 'ml.g4dn.xlarge', strategy="SingleRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")

The algorithm works correctly, and performs inference on all of the above sentences in the CSV correctly. Any reason why this is happening?

EDIT 1: When I print the input payload it looks like this

"joe annes rifle accesories discount"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 

"cute puppies for sale"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 

"Two dudes talk about sports"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle

Where each sentence is a inference example, and is separated by this statement

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle

So it seems like sagemaker is separating the inference examples. But when I try and pass these sentences into a huggingface tokenizer, the tokenizer tokenizes them like they are one inference example, when they should be 3 distinct inference examples.

已提问 2 年前367 查看次数
1 回答
0

Hi,

This issue is not quite related to SageMaker but how you pass the data to transformer tokenizer.

You are right on split_type="Line" which splits your CSV files by lines. However, MultiRecord will ask SageMaker to pack as many lines as possible, up to MaxPayloadInMB as described in this doc. The default value is 6 MB. SingleRecord, on the other hand, will pass lines one by one.

The text will be from Byte/IO stream, which essentially something like a string as follows, "joe annes rifle accesories discount\ncute puppies for sale\nTwo dudes talk about sports\nSmith & Wesson M&P 500 review\nGlock vs 1911 handgun" If we pass this directly to tokenizer, it will be treated as a single string.

You can firstly parse the string to a list like

["joe annes rifle accesories discount",  "cute puppies for sale", "Two dudes talk about sports", "Smith & Wesson M&P 500 review", "Glock vs 1911 handgun"]

before passing data to tokenizer, which will result in a (5, xxx) tensor. This could ensure transformer understand sentences individually.

AWS
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则