Personalize fails to read compressed CSV?

Question

I have (user-item) interaction data in a CSV and I'm trying to import these data into a Personalize Dataset. When the CSV files are (GZ) compressed, I get the error message:
> Input csv is missing the following columns: [EVENT_TYPE, ITEM_ID, TIMESTAMP, USER_ID]

When I uncompress a test file, it loads fine. Files has the appropriate column headers. Indeed, the uncompressed file is consumed quite happily.

Is it truly possible that Personalize cannot read compressed CSV files? This seems so unbelievable that I feel like I must be doing something wrong. Uncompressing all of these plaintext data is incredibly time- and space-consuming. Am I missing some trick for compressed files? There's zero mention of this in the Personalize documentation. (Likewise, there's no mention of S3 _prefixes_ to load multiple/sharded data files in a single job, either ... this portion of the Personalize documentation is quite poor and it seems trial & error is the only way to discover the nuance.)

Answer

Hi,

As specified in the Data format guidelines page of the [Personalize documentation](https://docs.aws.amazon.com/personalize/latest/dg/data-prep-formatting.html), input data must be in a CSV file.

The first step in the Amazon Personalize workflow is to [create a dataset group](https://docs.aws.amazon.com/personalize/latest/dg/data-prep-ds-group.html). Upon creating a dataset group, if you want to import data from multiple data sources into an Amazon Personalize dataset, you can use Amazon SageMaker Data Wrangler. Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, and analyze data. See [bulk data imports](https://docs.aws.amazon.com/personalize/latest/dg/bulk-data-import.html) in the documentation.

If your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a / after the folder name: `s3:////`, see [Importing bulk records with a dataset import job](https://docs.aws.amazon.com/personalize/latest/dg/bulk-data-import-step.html)

Lastly, you have three ways to update your datasets in Personalize, see [this blogpost](https://aws.amazon.com/blogs/machine-learning/incrementally-update-a-dataset-with-a-bulk-import-mechanism-in-amazon-personalize/) for a comprehensive explanation.

Hope this helps.

Answer

Hi,  the simplest way to go around this is to create a Lambda that is automatically triggered each time a file is written to your bucket. If it's a zip, it will automatically decompress it for you

See https://levelup.gitconnected.com/automating-zip-extraction-with-lambda-and-s3-9a083d4e8bab for an example

Best,

Didier

Personalize fails to read compressed CSV?

관련 콘텐츠