Skip to content

Partition projection on existing table is significantly slower than a new table with identical configuration

0

I'm experiencing a significant performance difference between two Athena tables that have identical partition projection configurations but were created differently.

Setup

I have two tables pointing to the same S3 location (s3://my-bucket/impressions/), both using Parquet format with minute-level partitions (year/month/day/hour/minute).

  • Table A (impressions): Originally created with traditional Glue Catalog partitions. Partition projection was later enabled via ALTER TABLE SET TBLPROPERTIES.
  • Table B (impressions_pp_test): Created from scratch with partition projection enabled from the beginning. No legacy partition metadata in Glue Catalog.

Both tables have identical projection configuration:

  • 'projection.enabled'='true',
  • 'projection.year.type'='integer',
  • 'projection.year.range'='2024,2030',
  • 'projection.month.type'='integer',
  • 'projection.month.range'='1,12',
  • 'projection.month.digits'='2',
  • ... (same for day, hour, minute)
  • 'storage.location.template'='s3://my-bucket/impressions/year=${year}/month=${month}/day=${day}/hour=${hour}/minute=${minute}'

Problem

Running the same query on both tables produces very different planning times:

TablePlanning TimeExecution TimeData Scanned
Table A (impressions)~60-85 seconds<1 second0.04 KB
Table B (impressions_pp_test)~15 seconds<1 second0.04 KB

The query like:

SELECT count(*) FROM <table>
  WHERE year = 2026 AND month = 03 AND day = 09
  AND hour = 02 AND minute BETWEEN 40 AND 44;

Question

According to the documentation:

"When partition projection is enabled, Athena ignores any partition metadata registered to the table in the AWS Glue Data Catalog or Hive metastore."

If Athena completely ignores Glue Catalog partition metadata when projection is enabled, why is Table A (with legacy partitions in the catalog) significantly slower in planning than Table B (no legacy partitions)?

Is there any additional overhead from legacy catalog metadata even when partition projection is active? Or is there another factor I'm missing?

asked 2 months ago80 views
2 Answers
1

The performance difference you're experiencing is likely due to Athena still needing to interact with the AWS Glue Data Catalog during the query planning phase, even when partition projection is enabled. While partition projection allows Athena to calculate partition locations without retrieving partition metadata, the table itself and its properties are still stored in and retrieved from the Glue Catalog.

When you have a table with a large number of pre-existing partitions registered in the Glue Catalog (Table A), there may be additional overhead during the initial catalog lookup and table metadata retrieval, even though Athena won't use those specific partition entries for the query. The catalog still contains all that legacy partition metadata, which could impact the speed of catalog operations.

In contrast, Table B was created fresh with partition projection from the start and never accumulated partition metadata in the catalog, resulting in a cleaner, more efficient catalog lookup.

To potentially improve Table A's performance, you could consider:

  1. Dropping the legacy partition metadata from the Glue Catalog while keeping partition projection enabled. Since partition projection calculates partitions dynamically, you don't need the old partition entries.

  2. Recreating the table with partition projection enabled from the start (similar to Table B), which would give you a clean slate without legacy metadata.

  3. Verifying your partition projection configuration ensures you're including predicates on all partition keys in your queries and that the ranges specified are reasonable, as unnecessarily broad ranges can cause performance issues.

The key insight is that while partition projection bypasses partition metadata retrieval for query execution, the overall table metadata and catalog interaction can still be affected by the presence of large amounts of legacy partition data in the Glue Catalog.
Sources
Optimize data - Amazon Athena
Optimize queries with AWS Glue partition indexing and filtering - Amazon Athena

answered 2 months ago
EXPERT
reviewed 2 months ago
0

Update: I found an additional factor. The two tables had different projection.year.range settings:

  • Fast table (impressions_pp_test): 'projection.year.range'='2026,2027' (2 years)
  • Slow table (impressions_pp): 'projection.year.range'='2024,2060' (37 years)

After narrowing the slow table's range to '2024,2027', planning time dropped from ~76 seconds to ~9 seconds — even though the query already had WHERE year = 2026.

This suggests that partition projection computes the full projection space from the range configuration before applying WHERE clause pruning. The wider the range, the more combinations are generated upfront (year × month × day × hour × minute), which significantly impacts planning time for minute-level partitioned tables.

I couldn't find any documentation mentioning this behavior. Is this expected? Are there any best practices for setting projection ranges on tables with high-granularity partitions?

answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.