Clarification on S3 Select with Parquet - indexing, range offsets, and pricing

0
  1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

  2. For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

  3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

demandé il y a 2 ans1190 vues
1 réponse
0

Thank you for reaching AWS repost, Please find my answers as below,

Q1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

Answer - To understand the specific requirements driving this question, I request you to raise an S3 support case with AWS Support from your account to deep dive further and provide you relevant details.

Q2.For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

Answer - For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed. This is explained in documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

Q3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

Answer - Select will scan minimum data required to execute the query. The billing will be done using compressed size if data on S3 is already compressed.

AWS
INGÉNIEUR EN ASSISTANCE TECHNIQUE
Sathya
répondu il y a 2 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions