Can I verify that Glue is using push_down_predicate?

0

Hello,

I have a Glue table based on parquet files in S3 that is partitioned by 3 columns:

  • event_date_year
    • event_date_month
      • event_date_day

I want to select data for a list of given event dates, so something like

"(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') 
OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"

How can I enforce and/or verify that these filters will actually be pushed down to the partition level / S3? I am using this code to create the dynamic frame:

dyf1 = glueContext.create_dynamic_frame.from_catalog(
    database='dbname', table_name='tablename', 
    push_down_predicate=  "(event_date_year = '2024' AND event_date_month = '03' AND event_date_day = '22') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '02') OR (event_date_year = '2024' AND event_date_month = '04' AND event_date_day = '04')"
)

When I run dyf1.toDF().explain() , I only see "Scan ExistingRDD", but nothing about predicate pushdown. But then again, the reason might be that I specified the pushdown in the Glue dynamic frame, and Spark might not know anything about it...

I also tried to use .filter() and .where() on the Spark DataFrame and then ran the explain() again, but then the explain output always shows something like

  • Filter ...
    • Scan ExistingRDD... which makes me think like it's scanning the full data and not doing predicate pushdown...

How can I verify if a pushdown is really happening? Can I assume it is if the glueContext.create_dynamic_frame.from_catalog() with the push_down_predicate` parameter is not throwing an error?

Thanks, Mark

Mark
質問済み 3ヶ月前176ビュー
1回答
1

That's a good question, the only way I can think of is using a filter that doesn't return any data and compare the time it takes to run with and without pushdown (assuming the table has enough partitions so it makes a difference, if you don't then it doesn't really matter).

profile pictureAWS
エキスパート
回答済み 2ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ