I am running a batch transform job that us uploading data from a CSV. The CSV is formatted as such
"joe annes rifle accesories discount"
"cute puppies for sale"
"Two dudes talk about sports"
"Smith & Wesson M&P 500 review"
"Glock vs 1911 handgun"
My code for creating the batch transform is below
elec_model = PyTorchModel(model_data='s3://some_path/binary-models/tar_models/14_10_2022__19_54_23_arms_ammunition.tar.gz',
role=role,
entry_point='torchserve_.py',
source_dir='source_dir',
framework_version='1.12.0',
py_version='py38')
nl_detector = elec_model.transformer(
instance_count = 1,
instance_type = 'ml.g4dn.xlarge', strategy="MultiRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")
nl_detector.transform("s3://brand-safety-training-data/trash", content_type="text/csv", split_type="Line")
When I run this code instead of the batch job taking the CSV and breaking up the examples with every space, which is what
split_type="Line"
is telling the algorithm to do, but instead it just ingests all of the sentences in the above CSV, and outputs 1 probability. Also, if I do the same thing with the same code, but switch
strategy="MultiRecord"
to
strategy="SingleRecord"
so the one code block would look like this
nl_detector = elec_model.transformer(
instance_count = 1,
instance_type = 'ml.g4dn.xlarge', strategy="SingleRecord", assemble_with="Line", output_path = "s3://some_path/trash_output")
The algorithm works correctly, and performs inference on all of the above sentences in the CSV correctly. Any reason why this is happening?
EDIT 1:
When I print the input payload it looks like this
"joe annes rifle accesories discount"
2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
"cute puppies for sale"
2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
"Two dudes talk about sports"
2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle
Where each sentence is a inference example, and is separated by this statement
2022-10-26T21:03:04,265 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle
So it seems like sagemaker is separating the inference examples. But when I try and pass these sentences into a huggingface tokenizer, the tokenizer tokenizes them like they are one inference example, when they should be 3 distinct inference examples.