Can I verify that Glue is using push_down_predicate?

0

Hello,

I have a Glue table based on parquet files in S3 that is partitioned by 3 columns:

  • event_date_year
    • event_date_month
      • event_date_day

I want to select data for a list of given event dates, so something like

"(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"

How can I enforce and/or verify that these filters will actually be pushed down to the partition level / S3? I am using this code to create the dynamic frame:

dyf1 = glueContext.create_dynamic_frame.from_catalog(
    database='dbname', table_name='tablename', 
    push_down_predicate=  "(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
)

When I run dyf1.toDF().explain() , I only see "Scan ExistingRDD", but nothing about predicate pushdown. But then again, the reason might be that I specified the pushdown in the Glue dynamic frame, and Spark might not know anything about it...

I also tried to use .filter() and .where() on the Spark DataFrame and then ran the explain() again, but then the explain output always shows something like

  • Filter ...
    • Scan ExistingRDD... which makes me think like it's scanning the full data and not doing predicate pushdown...

How can I verify if a pushdown is really happening? Can I assume it is if the glueContext.create_dynamic_frame.from_catalog() with the push_down_predicate` parameter is not throwing an error?

Thanks, Mark

Mark
gefragt vor 3 Monaten181 Aufrufe
1 Antwort
1

That's a good question, the only way I can think of is using a filter that doesn't return any data and compare the time it takes to run with and without pushdown (assuming the table has enough partitions so it makes a difference, if you don't then it doesn't really matter).

profile pictureAWS
EXPERTE
beantwortet vor 3 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen