SageMaker batch transform not loading CSV correctly

Question

I am running a batch transform job that us uploading data from a CSV. The CSV is formatted as such 
```
"joe annes rifle accesories discount"
"cute puppies for sale"
"Two dudes talk about sports"
"Smith & Wesson M&P 500 review"
"Glock vs 1911 handgun"
```
My code for creating the batch transform is below

```
elec_model = PyTorchModel(model_data='s3://some_path/binary-models/tar_models/14_10_2022__19_54_23_arms_ammunition.tar.gz',
                         role=role,
                         entry_point='torchserve_.py',
                         source_dir='source_dir',
                          framework_version='1.12.0',
                          py_version='py38')
```

```
nl_detector = elec_model.transformer(
                     instance_count = 1,
                     instance_type = 'ml.g4dn.xlarge', strategy="MultiRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")
```

```
nl_detector.transform("s3://brand-safety-training-data/trash", content_type="text/csv", split_type="Line")
```
 When I run this code instead of the batch job taking the CSV and breaking up the examples with every space, which is what 
```
split_type="Line" 
```
is telling the algorithm to do, but instead it just ingests all of the sentences in the above CSV, and outputs 1 probability. Also, if I do the same thing with the same code, but switch 
```
strategy="MultiRecord"
```
to

```
strategy="SingleRecord"
```
so the one code block would look like this
```
nl_detector = elec_model.transformer(
                     instance_count = 1,
                     instance_type = 'ml.g4dn.xlarge', strategy="SingleRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")
```

The algorithm works correctly, and performs inference on all of the above sentences in the CSV correctly. Any reason why this is happening?

EDIT 1:
When I print the input payload it looks like this

```
"joe annes rifle accesories discount"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -

"cute puppies for sale"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -

"Two dudes talk about sports"

2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle
```

Where each sentence is a inference example, and is separated by this statement

```
2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle
```
So it seems like sagemaker is separating the inference examples. But when I try and pass these sentences into a huggingface tokenizer, the tokenizer tokenizes them like they are one inference example, when they should be 3 distinct inference examples.

Answer

Hi,

This issue is not quite related to SageMaker but how you pass the data to transformer tokenizer.

You are right on `split_type="Line" ` which splits your CSV files by lines. However, `MultiRecord` will ask SageMaker to pack as many lines as possible, up to `MaxPayloadInMB` as described in this [doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxPayloadInMB). The default value is 6 MB. `SingleRecord`, on the other hand, will pass lines one by one.

The text will be from Byte/IO stream, which essentially something like a string as follows,
`"joe annes rifle accesories discount
cute puppies for sale
Two dudes talk about sports
Smith & Wesson M&P 500 review
Glock vs 1911 handgun"`
If we pass this directly to tokenizer, it will be treated as a single string.

You can firstly parse the string to a list like
```
["joe annes rifle accesories discount",  "cute puppies for sale", "Two dudes talk about sports", "Smith & Wesson M&P 500 review", "Glock vs 1911 handgun"]
```
before passing data to tokenizer, which will result in a (5, xxx) tensor. This could ensure transformer understand sentences individually.

SageMaker batch transform not loading CSV correctly

相关内容