Hello,
I have a Glue table based on parquet files in S3 that is partitioned by 3 columns:
I want to select data for a list of given event dates, so something like
"(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22')
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02')
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
How can I enforce and/or verify that these filters will actually be pushed down to the partition level / S3?
I am using this code to create the dynamic frame:
dyf1 = glueContext.create_dynamic_frame.from_catalog(
database='dbname', table_name='tablename',
push_down_predicate= "(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
)
When I run dyf1.toDF().explain()
, I only see "Scan ExistingRDD
", but nothing about predicate pushdown. But then again, the reason might be that I specified the pushdown in the Glue dynamic frame, and Spark might not know anything about it...
I also tried to use .filter()
and .where()
on the Spark DataFrame and then ran the explain() again, but then the explain output always shows something like
Filter ...
Scan ExistingRDD...
which makes me think like it's scanning the full data and not doing predicate pushdown...
How can I verify if a pushdown is really happening?
Can I assume it is if the glueContext.create_dynamic_frame.from_catalog() with the
push_down_predicate` parameter is not throwing an error?
Thanks,
Mark