- Newest
- Most votes
- Most comments
Hi,
This issue is not quite related to SageMaker but how you pass the data to transformer tokenizer.
You are right on split_type="Line"
which splits your CSV files by lines. However, MultiRecord
will ask SageMaker to pack as many lines as possible, up to MaxPayloadInMB
as described in this doc. The default value is 6 MB. SingleRecord
, on the other hand, will pass lines one by one.
The text will be from Byte/IO stream, which essentially something like a string as follows,
"joe annes rifle accesories discount\ncute puppies for sale\nTwo dudes talk about sports\nSmith & Wesson M&P 500 review\nGlock vs 1911 handgun"
If we pass this directly to tokenizer, it will be treated as a single string.
You can firstly parse the string to a list like
["joe annes rifle accesories discount", "cute puppies for sale", "Two dudes talk about sports", "Smith & Wesson M&P 500 review", "Glock vs 1911 handgun"]
before passing data to tokenizer, which will result in a (5, xxx) tensor. This could ensure transformer understand sentences individually.
Relevant content
- asked 8 months ago
- asked 7 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago