Clarification on S3 Select with Parquet - indexing, range offsets, and pricing

0
  1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

  2. For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

  3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

preguntada hace 2 años1190 visualizaciones
1 Respuesta
0

Thank you for reaching AWS repost, Please find my answers as below,

Q1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

Answer - To understand the specific requirements driving this question, I request you to raise an S3 support case with AWS Support from your account to deep dive further and provide you relevant details.

Q2.For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

Answer - For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed. This is explained in documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

Q3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

Answer - Select will scan minimum data required to execute the query. The billing will be done using compressed size if data on S3 is already compressed.

AWS
INGENIERO DE SOPORTE
Sathya
respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas