Clarification on S3 Select with Parquet - indexing, range offsets, and pricing

0
  1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

  2. For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

  3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

已提問 2 年前檢視次數 1190 次
1 個回答
0

Thank you for reaching AWS repost, Please find my answers as below,

Q1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

Answer - To understand the specific requirements driving this question, I request you to raise an S3 support case with AWS Support from your account to deep dive further and provide you relevant details.

Q2.For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

Answer - For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed. This is explained in documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

Q3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

Answer - Select will scan minimum data required to execute the query. The billing will be done using compressed size if data on S3 is already compressed.

AWS
支援工程師
Sathya
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南