Can I verify that Glue is using push_down_predicate?

0

Hello,

I have a Glue table based on parquet files in S3 that is partitioned by 3 columns:

  • event_date_year
    • event_date_month
      • event_date_day

I want to select data for a list of given event dates, so something like

"(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"

How can I enforce and/or verify that these filters will actually be pushed down to the partition level / S3? I am using this code to create the dynamic frame:

dyf1 = glueContext.create_dynamic_frame.from_catalog(
    database='dbname', table_name='tablename', 
    push_down_predicate=  "(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
)

When I run dyf1.toDF().explain() , I only see "Scan ExistingRDD", but nothing about predicate pushdown. But then again, the reason might be that I specified the pushdown in the Glue dynamic frame, and Spark might not know anything about it...

I also tried to use .filter() and .where() on the Spark DataFrame and then ran the explain() again, but then the explain output always shows something like

  • Filter ...
    • Scan ExistingRDD... which makes me think like it's scanning the full data and not doing predicate pushdown...

How can I verify if a pushdown is really happening? Can I assume it is if the glueContext.create_dynamic_frame.from_catalog() with the push_down_predicate` parameter is not throwing an error?

Thanks, Mark

Mark
posta un mese fa150 visualizzazioni
1 Risposta
1

That's a good question, the only way I can think of is using a filter that doesn't return any data and compare the time it takes to run with and without pushdown (assuming the table has enough partitions so it makes a difference, if you don't then it doesn't really matter).

profile pictureAWS
ESPERTO
con risposta un mese fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande