Athena scans all data when querying for struct field

0

Hi,
Let's say I have an AWS Glue table with the following schema:

CREATE EXTERNAL TABLE `simulated_data`(
  `stream_src` string,
  `raw_data` string, 
  `my_struct` struct<date:string,time:string,value:float>)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://my-bucket/'
TBLPROPERTIES (
  'classification'='parquet', 
  'parquet.compress'='SNAPPY', 
  'transient_lastDdlTime'='1565760085')

The data is stored in S3 in multiple Apache Parquet files. The raw_data field contains JSON string with the same data (and structure) as my_struct struct. Example of data row is below:

{
    "stream_src": "src1",
    "raw_data": "{\"date\":\"\",\"value\":18.7,\"time\":\"5:30\"}",
    "my_struct": {
        "date": "",
        "time": "5:30",
        "value": 18.7
    }
}

I have a problem with querying the structure - Athena scans all data stored in the S3 bucket. The query that I use is:
Run time: 2.36 seconds, Data scanned: 1.88 MB (same size as select * from simulated_data )

SELECT my_struct.time
FROM simulated_data
WHERE stream_src='src1'

Surprisingly, the problem does not exist with JSON_extract
Run time: 2.02 seconds, Data scanned: 1.12 MB

SELECT json_extract(raw_data, '$.time')
FROM simulated_data
WHERE stream_src='src1'

What is the reason for that? Why reading field of structure leads to scanning the whole data set? Answering those questions probably will give me a light on why is JSON_extract faster than selecting data from the struct.

posta 5 anni fa3804 visualizzazioni
3 Risposte
0
Risposta accettata

There's some quirks with Presto (underlying engine) and nested types (array/map/struct). This is mentioned briefly here - https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html

"When you query columns with complex data types (array, map, struct), and are using Parquet for storing data, Athena currently reads an entire row of data, instead of selectively reading only the specified columns as expected. This is a known issue."

If you dig on the PrestoDB git repo (Athena is based on an older version of this afaik e.g. 0.176 or something like that) there's various issues about predicate pushdown/optimizations of reading nested fields e.g. this issue and the linked issues - https://github.com/prestodb/presto/issues/11326 - although I'm not sure on the current state of this.

I'm not sure if this is better for ORC possibly (maybe worth testing if you're willing to change file formats). Otherwise it's basically:

  1. Optimization makes its way into Presto
  2. Athena uses the version of Presto that has that optimization (or just backports it/implements it themselves)
con risposta 5 anni fa
0

Thank you for your answer. I will try with ORC.

Edit:
It seems to be working better with ORC. However, it is worth to notice, that ORC has other limitation - as long as stripe size is below 8MB (default for Presto), it will read the whole file anyway - doesn't matter whether query regarded struct or simple field.

Edited by: offroader on Aug 20, 2019 8:37 PM

con risposta 5 anni fa
0

Yeah, I ran into this as well, even with the example data sets that AWS uses, mentioned here - https://forums.aws.amazon.com/message.jspa?messageID=841338#841338 and here https://forums.aws.amazon.com/thread.jspa?messageID=846250

I think maybe there's some tool or version of tool (old version of Hive/Spark or something?) that must set 8MB as the default or something because I've seen this happen multiple times from other people even when the actual files they're generating are > 8MB e.g. the AWS example files I think were 200MB+ * ~28 files (file per year).

If I look at the ORC Java library for instance the default stripe size is 64MB - https://orc.apache.org/docs/hive-config.html

Edited by: rruppmgp on Aug 21, 2019 6:52 AM

con risposta 5 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande