Personalize fails to read compressed CSV?

0

I have (user-item) interaction data in a CSV and I'm trying to import these data into a Personalize Dataset. When the CSV files are (GZ) compressed, I get the error message:

Input csv is missing the following columns: [EVENT_TYPE, ITEM_ID, TIMESTAMP, USER_ID]

When I uncompress a test file, it loads fine. Files has the appropriate column headers. Indeed, the uncompressed file is consumed quite happily.

Is it truly possible that Personalize cannot read compressed CSV files? This seems so unbelievable that I feel like I must be doing something wrong. Uncompressing all of these plaintext data is incredibly time- and space-consuming. Am I missing some trick for compressed files? There's zero mention of this in the Personalize documentation. (Likewise, there's no mention of S3 prefixes to load multiple/sharded data files in a single job, either ... this portion of the Personalize documentation is quite poor and it seems trial & error is the only way to discover the nuance.)

Murat
asked 8 months ago243 views
2 Answers
1

Hi,

As specified in the Data format guidelines page of the Personalize documentation, input data must be in a CSV file.

The first step in the Amazon Personalize workflow is to create a dataset group. Upon creating a dataset group, if you want to import data from multiple data sources into an Amazon Personalize dataset, you can use Amazon SageMaker Data Wrangler. Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, and analyze data. See bulk data imports in the documentation.

If your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a / after the folder name: s3://<name of your S3 bucket>/<folder path>/, see Importing bulk records with a dataset import job

Lastly, you have three ways to update your datasets in Personalize, see this blogpost for a comprehensive explanation.

Hope this helps.

profile pictureAWS
jnavrro
answered 8 months ago
profile pictureAWS
EXPERT
reviewed 8 months ago
  • Yes, I know. As mentioned in the question: the data is already in CSV format, and is being read correctly in that format. The question is about compressed CSV files. (And secondarily a question about multiple files in an S3 prefix, as is typical for a sharded-storage scheme.)

  • Agreed on the fact that data input for Personalize must be natively in CSV format.

  • Personalize does not support compressed file format today, only CSV is supported. Regarding the second point, if your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a / after the folder name: s3://<name of your S3 bucket>/<folder path>/. See https://docs.aws.amazon.com/personalize/latest/dg/bulk-data-import-step.html

0

Hi, the simplest way to go around this is to create a Lambda that is automatically triggered each time a file is written to your bucket. If it's a zip, it will automatically decompress it for you

See https://levelup.gitconnected.com/automating-zip-extraction-with-lambda-and-s3-9a083d4e8bab for an example

Best,

Didier

profile pictureAWS
EXPERT
answered 8 months ago
  • Thanks for the idea! I'm first seeking to confirm that in fact Personalize does not handle compressed files. Decompressing our full initial training set is quite cumbersome (even with Lambda). Given that every other AWS 'data' product with which I've dealt handles various forms of compression, this still feels like it should be possible.

    Also, rather than decompressing each file -- I'd probably instead trigger a Lambda to decompress in-memory (i.e. streaming) and use Personalize's PutItems endpoint for incremental training. (Though would still prefer to just import the GZ CSV files :-)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions