Clarification on S3 Select with Parquet - indexing, range offsets, and pricing

Question

1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

2. For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

Answer

Thank you for reaching AWS repost, Please find my answers as below,

Q1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

Answer - To understand the specific requirements driving this question, I request you to raise an S3 support case with AWS Support from your account to deep dive further and provide you relevant details.

Q2.For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

Answer - For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed. This is explained in documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

Q3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

Answer - Select will scan minimum data required to execute the query. The billing will be done using compressed size if data on S3 is already compressed.

Clarification on S3 Select with Parquet - indexing, range offsets, and pricing

Contenido relevante