Can I verify that Glue is using push_down_predicate?

Question

Hello,

I have a Glue table based on parquet files in S3 that is partitioned by 3 columns:
- event_date_year
  - event_date_month 
    - event_date_day

I want to select data for a list of given event dates, so something like

"(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') 
    OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') 
    OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"

How can I enforce and/or verify that these filters will actually be pushed down to the partition level / S3?
I am using this code to create the dynamic frame:

dyf1 = glueContext.create_dynamic_frame.from_catalog(
        database='dbname', table_name='tablename', 
        push_down_predicate=  "(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
    )

When I run `dyf1.toDF().explain()` , I only see "`Scan ExistingRDD`", but nothing about predicate pushdown. But then again, the reason might be that I specified the pushdown in the Glue dynamic frame, and Spark might not know anything about it...

I also tried to use `.filter()` and `.where()` on the Spark DataFrame and then ran the explain() again, but then the explain output always shows something like
- `Filter ...`
  - `Scan ExistingRDD...`
which makes me think like it's scanning the full data and not doing predicate pushdown...

How can I verify if a pushdown is really happening?
Can I assume it is if the `glueContext.create_dynamic_frame.from_catalog() with the `push_down_predicate` parameter is not throwing an error?

Thanks, 
Mark

Answer

That's a good question, the only way I can think of is using a filter that doesn't return any data and compare the time it takes to run with and without pushdown (assuming the table has enough partitions so it makes a difference, if you don't then it doesn't really matter).

Can I verify that Glue is using push_down_predicate?

관련 콘텐츠