- Newest
- Most votes
- Most comments
There's some quirks with Presto (underlying engine) and nested types (array/map/struct). This is mentioned briefly here - https://docs.aws.amazon.com/athena/latest/ug/other-notable-limitations.html
"When you query columns with complex data types (array, map, struct), and are using Parquet for storing data, Athena currently reads an entire row of data, instead of selectively reading only the specified columns as expected. This is a known issue."
If you dig on the PrestoDB git repo (Athena is based on an older version of this afaik e.g. 0.176 or something like that) there's various issues about predicate pushdown/optimizations of reading nested fields e.g. this issue and the linked issues - https://github.com/prestodb/presto/issues/11326 - although I'm not sure on the current state of this.
I'm not sure if this is better for ORC possibly (maybe worth testing if you're willing to change file formats). Otherwise it's basically:
- Optimization makes its way into Presto
- Athena uses the version of Presto that has that optimization (or just backports it/implements it themselves)
Thank you for your answer. I will try with ORC.
Edit:
It seems to be working better with ORC. However, it is worth to notice, that ORC has other limitation - as long as stripe size is below 8MB (default for Presto), it will read the whole file anyway - doesn't matter whether query regarded struct or simple field.
Edited by: offroader on Aug 20, 2019 8:37 PM
Yeah, I ran into this as well, even with the example data sets that AWS uses, mentioned here - https://forums.aws.amazon.com/message.jspa?messageID=841338#841338 and here https://forums.aws.amazon.com/thread.jspa?messageID=846250
I think maybe there's some tool or version of tool (old version of Hive/Spark or something?) that must set 8MB as the default or something because I've seen this happen multiple times from other people even when the actual files they're generating are > 8MB e.g. the AWS example files I think were 200MB+ * ~28 files (file per year).
If I look at the ORC Java library for instance the default stripe size is 64MB - https://orc.apache.org/docs/hive-config.html
Edited by: rruppmgp on Aug 21, 2019 6:52 AM
Relevant content
- asked 7 months ago
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago