Personalize fails to read compressed CSV?

0

I have (user-item) interaction data in a CSV and I'm trying to import these data into a Personalize Dataset. When the CSV files are (GZ) compressed, I get the error message:

Input csv is missing the following columns: [EVENT_TYPE, ITEM_ID, TIMESTAMP, USER_ID]

When I uncompress a test file, it loads fine. Files has the appropriate column headers. Indeed, the uncompressed file is consumed quite happily.

Is it truly possible that Personalize cannot read compressed CSV files? This seems so unbelievable that I feel like I must be doing something wrong. Uncompressing all of these plaintext data is incredibly time- and space-consuming. Am I missing some trick for compressed files? There's zero mention of this in the Personalize documentation. (Likewise, there's no mention of S3 prefixes to load multiple/sharded data files in a single job, either ... this portion of the Personalize documentation is quite poor and it seems trial & error is the only way to discover the nuance.)

Murat
질문됨 9달 전263회 조회
2개 답변
1

Hi,

As specified in the Data format guidelines page of the Personalize documentation, input data must be in a CSV file.

The first step in the Amazon Personalize workflow is to create a dataset group. Upon creating a dataset group, if you want to import data from multiple data sources into an Amazon Personalize dataset, you can use Amazon SageMaker Data Wrangler. Data Wrangler is a feature of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, and analyze data. See bulk data imports in the documentation.

If your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a / after the folder name: s3://<name of your S3 bucket>/<folder path>/, see Importing bulk records with a dataset import job

Lastly, you have three ways to update your datasets in Personalize, see this blogpost for a comprehensive explanation.

Hope this helps.

profile pictureAWS
jnavrro
답변함 9달 전
profile pictureAWS
전문가
검토됨 9달 전
  • Yes, I know. As mentioned in the question: the data is already in CSV format, and is being read correctly in that format. The question is about compressed CSV files. (And secondarily a question about multiple files in an S3 prefix, as is typical for a sharded-storage scheme.)

  • Agreed on the fact that data input for Personalize must be natively in CSV format.

  • Personalize does not support compressed file format today, only CSV is supported. Regarding the second point, if your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a / after the folder name: s3://<name of your S3 bucket>/<folder path>/. See https://docs.aws.amazon.com/personalize/latest/dg/bulk-data-import-step.html

0

Hi, the simplest way to go around this is to create a Lambda that is automatically triggered each time a file is written to your bucket. If it's a zip, it will automatically decompress it for you

See https://levelup.gitconnected.com/automating-zip-extraction-with-lambda-and-s3-9a083d4e8bab for an example

Best,

Didier

profile pictureAWS
전문가
답변함 9달 전
  • Thanks for the idea! I'm first seeking to confirm that in fact Personalize does not handle compressed files. Decompressing our full initial training set is quite cumbersome (even with Lambda). Given that every other AWS 'data' product with which I've dealt handles various forms of compression, this still feels like it should be possible.

    Also, rather than decompressing each file -- I'd probably instead trigger a Lambda to decompress in-memory (i.e. streaming) and use Personalize's PutItems endpoint for incremental training. (Though would still prefer to just import the GZ CSV files :-)

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠