Can I verify that Glue is using push_down_predicate?

0

Hello,

I have a Glue table based on parquet files in S3 that is partitioned by 3 columns:

  • event_date_year
    • event_date_month
      • event_date_day

I want to select data for a list of given event dates, so something like

"(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"

How can I enforce and/or verify that these filters will actually be pushed down to the partition level / S3? I am using this code to create the dynamic frame:

dyf1 = glueContext.create_dynamic_frame.from_catalog(
    database='dbname', table_name='tablename', 
    push_down_predicate=  "(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
)

When I run dyf1.toDF().explain() , I only see "Scan ExistingRDD", but nothing about predicate pushdown. But then again, the reason might be that I specified the pushdown in the Glue dynamic frame, and Spark might not know anything about it...

I also tried to use .filter() and .where() on the Spark DataFrame and then ran the explain() again, but then the explain output always shows something like

  • Filter ...
    • Scan ExistingRDD... which makes me think like it's scanning the full data and not doing predicate pushdown...

How can I verify if a pushdown is really happening? Can I assume it is if the glueContext.create_dynamic_frame.from_catalog() with the push_down_predicate` parameter is not throwing an error?

Thanks, Mark

Mark
질문됨 한 달 전145회 조회
1개 답변
1

That's a good question, the only way I can think of is using a filter that doesn't return any data and compare the time it takes to run with and without pushdown (assuming the table has enough partitions so it makes a difference, if you don't then it doesn't really matter).

profile pictureAWS
전문가
답변함 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인