Athena Partition Projection and Column Stats
How would column stats work with Athena partition projection on a partitioned table? I'm assuming that it doesn't. Partition projection avoids any lookup in Glue and hence stats for the partition columns are also not fetched. Or would I be wrong about this and Athena would still try and use table level (as opposed to partition level) column statistics and hence it's still worth periodically analyzing the partitioned table in order to get some table stats?
Hello,
When you enable partition projection on a table, Athena ignores any partition metadata in the AWS Glue Data Catalog or external Hive metastore for that table. But, with DESCRIBE TABLE query, you can get the list of columns, including partition columns, for the named column. This allows you to examine the attributes of a complex column. You can also list table properties using SHOW TBLPROPERTIES query.
REFERENCES:
https://docs.aws.amazon.com/athena/latest/ug/describe-table.html https://docs.aws.amazon.com/athena/latest/ug/show-tblproperties.html
Relevant questions
HIVE_PARTITION_SCHEMA_MISMATCH - Athena error on S3 Parquet file
asked a month agoAWS Glue API get-partitions can't seem to cope with the partition column name "key" in the `expression` filter
asked 2 months agoAthena query analysis all data, rather than specified columns
Accepted Answerasked 4 months agoAthena Query timeout
asked 5 days agoCreate an Athena-queryable CloudTrail with CDK (or CloudFormation?)
Accepted AnswerAthena Partition Projection and Column Stats
asked 2 months agoBest way to overcome HIVE_PARTITION_SCHEMA_MISMATCH error in Athena while preserving structure of structs?
asked a month agoGrouping of partitioned dataframes
asked 3 months agoQuery Id: ddfe7caf-be98-4c33-a2f1-b50502a22961
asked 4 months agoAthena - GENERIC_INTERNAL_ERROR: No value present
asked 5 months ago
Not quite the question I'm asking. Table and partition stats as returned by:
https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForPartition.html https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html
That data is used for cost based optimization within Athena I'm assuming. If partition projection skips the call to Glue Catalog and hence does not collect column stats....how does this impact Athena cost based optimization for query planning? Would query planning then lack this information and hence more inefficient join queries would be produced? Without that information the optimizer would not be able to do such a good job of automatic join reordering, this would have an impact on performance.
So, does partition projection prevent Athena from accessing column stats and therefore have a negative impact of cost based optimization and join performance?