Dynamic(?) Partition Projection issue between Athena and Glue

0

Hi folks,

I have a partitioned table in Athena that uses dynamic partition projection, enabled with the following table properties:

projection.account.type  injected
projection.region.type   injected
projection.source.type  injected
projection.calendarday.type  date
projection.calendarday.interval  1
projection.calendarday.range  2020/01/01, NOW
projection.enabled true
classification json
projection.calendarday.interval.unit DAYS
EXTERNAL TRUE
projection.calendarday.format    yyyy/MM/dd
storage.location.template
s3://mybucket/${account}/${source}/${region}/${calendarday}/

However, I don't see any partition records under the AWS Glue -> Table details -> Partition tab. Enter image description here Enter image description here

Despite this, I'm able to query data from the table in Athena. For example: for example: Enter image description here

`select *
FROM "raw"."cloudtrail_logs" WHERE account='xxxxxx' 
AND source='CloudTrail' AND region='us-west-2'
AND parse_datetime(calendarday, 'yyyy/MM/dd') = DATE('2024-08-15')
-- AND calendarday='2024/08/01' (this also works)
AND eventName = 'GetDashboard' and eventSource = 'quicksight.amazonaws.com'
limit 10`

The first four WHERE conditions correspond to the partition keys: account, source, region, and calendarday.

What confuses me is that the actual S3 path is structured as: s3://mybucket/account/source/region/2024/08/01 like 2024 = year, 08 = month, and then 01=day. It seems Athena is correctly recognizing the date folder and concatenating it as calendarday.

However, when I try to build a script in Glue to read data from the same database and table,

from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.dynamicframe import DynamicFrame
import sys

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)

params = {
    'raw_database': 'raw',  
    'raw_table': 'cloudtrail_logs'    
}

datasource_dynamic = glueContext.create_dynamic_frame.from_catalog(
    database=params['raw_database'],
    table_name=params['raw_table'],
    push_down_predicate="account='xxxxxxx' AND source='CloudTrail' AND region='us-west-2' AND calendarday='2024/08/01'",
    transformation_ctx="datasource_dynamic"
)

total_records = datasource_dynamic.count()
print(f"Total records: {total_records}")

I receive the following error:

error message: An error occurred while calling o100.getDynamicFrame. User's pushdown predicate: account='XXXXX' AND source='CloudTrail' AND region='us-west-2' AND calendarday='2024/08/01' can not be resolved against partition columns: []

I've tried specifying calendarday='2024/08/01' and calendarday='2024-08-01' in my script, and I also ran the query in Athena MSCK REPAIR TABLE , but neither approach works. Additionally, the metadata table in Glue was manually created and not generated by a crawler. I did try manually moving the data and creating an S3 path like s3://mybucket/account='XXXXX'/source='CloudTrail'/region='us-west-2'/calendarday='2024-08-01, After running the MSCK REPAIR command, it worked, but I believe this is not what I want as I have to manually create the calendarday folder '2024-08-01', '2024-08-02', '2024-08-03' etc... I want my GLUE script to be able to recognize 2024/08/01 as the calendarday partition column automatically, without needing the key=value format, is this achievable?

Any ideas on what might be causing this issue or how to resolve it

1 Answer
0

The issue you're experiencing stems from the difference in how Athena and AWS Glue handle partition projection.

When you use partition projection in Athena, it's important to understand that Athena ignores any partition metadata registered in the AWS Glue Data Catalog. Instead, Athena uses the table properties you've defined to determine partition values and locations. This is why you're able to query the data successfully in Athena despite not seeing any partitions in the AWS Glue console.

However, AWS Glue jobs don't use Athena's partition projection. They rely on the actual partition metadata stored in the AWS Glue Data Catalog. Since your table is using partition projection and there are no partitions registered in Glue, your Glue job is unable to resolve the partition columns, leading to the error you're seeing.

To resolve this issue, you have a few options:

  1. If you want to continue using partition projection with Athena while also being able to query the data with Glue jobs, you'll need to manually add the partitions to the AWS Glue Data Catalog. This can be done using the AWS Glue API or by running MSCK REPAIR TABLE in Athena after adjusting your S3 path to use the key=value format (as you've already discovered).

  2. If you prefer not to use the key=value format in your S3 path, you could create a custom classifier in AWS Glue to recognize your current S3 path structure. Then use a Glue crawler to populate the partition metadata in the Data Catalog.

  3. Another option is to modify your Glue job to read directly from S3 instead of using the Glue Data Catalog. This way, you can specify the S3 path and partition structure explicitly in your Glue job code.

It's worth noting that the partition projection feature in Athena is designed to improve query performance by avoiding the need to retrieve partition metadata from the Glue Data Catalog. However, this optimization is specific to Athena and isn't recognized by other services like AWS Glue.

In your specific case, where you want Glue to recognize 2024/08/01 as the calendarday partition column automatically without the key=value format, you might need to implement a custom solution in your Glue job. This could involve parsing the S3 path to extract the date components and constructing the calendarday value programmatically.
Sources
Use partition projection with Amazon Athena - Amazon Athena
Troubleshoot Athena partition projection issues | AWS re:Post
Troubleshoot Athena queries that have new partitions | AWS re:Post

profile picture
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions