Clarification on S3 Select with Parquet - indexing, range offsets, and pricing

0
  1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

  2. For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

  3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

질문됨 2년 전1190회 조회
1개 답변
0

Thank you for reaching AWS repost, Please find my answers as below,

Q1. Is there any indexing done on the data on S3's side? Or do they basically just execute the SQL in memory on the object via a scan?

Answer - To understand the specific requirements driving this question, I request you to raise an S3 support case with AWS Support from your account to deep dive further and provide you relevant details.

Q2.For Parquet files if you specify a range, what does that mean? Is it going to only scan the entire row group in that range? The column? Is the range a byte offset into the parquet object, and if so, is it the compressed size? ie: If I want to select from a column in the second rowgroup, I'd have my index by the byte size of the compressed row group + 1 ?

Answer - For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed. This is explained in documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

Q3. Select charges you based on the data scanned - if you're using parquet and you index into a row group to scan a column, is that only going to charge for the column scan? Decompressed or compressed?

Answer - Select will scan minimum data required to execute the query. The billing will be done using compressed size if data on S3 is already compressed.

AWS
지원 엔지니어
Sathya
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠