Skip to content

PUSHDOWN PREDICATE WITH SQL QUERY

0

Hello there! I am trying to create a kind of push_down_predicate from a sql_query using the conceptions from https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-pushdown.html.

I created this script which i hope to filter the table based on current day in int dataformat. But i have some exceptions that i need to get the day before or the day before the another day (if it is sunday or monday).

sample_query = """select *, location table_needed where anomesdia = (
	case 
		when DAYOFWEEK(current_date()) =1 then YEAR(date_sub(current_date(), interval 1 DAY)) * 10000 + MONTH(date_sub(current_date(), interval 1 DAY)) * 100 + DAY(date_sub(current_date(), interval 1 DAY))
		when DAYOFWEEK(current_date()) = 2 then YEAR(date_sub(current_date(), interval 2 DAY)) * 10000 + MONTH(date_sub(current_date(), interval 2 DAY)) * 100 + DAY(date_sub(current_date(), interval 2 DAY))
		else YEAR(current_date()) * 10000 + MONTH(current_date()) * 100 + DAY(current_date())
		end
)
"""

final_table = glueContext.create_dynamic_frame.from_catalog(
    database="database",
    table_name="table_needed",
    push_down_predicate = sample_query,
    additional_options={
	"hashfield": "anomesdia",
	"enablePartitioningForSampleQuery": True
}

)

Is it works?

asked 2 years ago934 views
3 Answers
0

The only thing you miss is to test it.... :D

Remember that AWS Glue uses a lazy evaluation approach, which means it doesn’t immediately execute the transformations when you define them. Instead, it records the transformations in the DynamicFrame and executes them when an action is called. So, without an action (like writing the data out or converting to a DataFrame), it’s hard to say if your script will work as expected.

EXPERT
answered 2 years ago
0

Hello,

Here are a few things you may need to notice:

  1. 'sample_query' is a full SQL query, whereas push_down_predicate should be a simple filter condition (e.g "column_name = value"), not a complex SQL query.
  2. When you use 'sample_query' and try to use 'enablePartitioningForSampleQuery', your sampleQuery must end with an 'AND' for AWS Glue to append partitioning conditions.

Thank you

AWS
answered 2 years ago
0

The method you use to add pushdown predicates to the create_dynamic_frame.from_catalog calls depends on the data source. For data catalog tables that point to partitioned data on S3, you can use the push_down_predicate option to filter based on the table partitions (see "Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog" on the AWS Big Data blog for a great intro to S3 table partitioning). For example, let's say that you have a table containing movie data from IMDB which is partitioned based on the movie's genre. Your Glue data catalog table might have information like this under the "Partitions" tab:

list of partitions for a movie table

If you wanted to only fetch movies from this table where the primary_genre partition is "Action" or "Comedy", you would write a push_down_predicate option which references this partition:

dyf = glueContext.create_dynamic_frame.from_catalog(
    database='imdb',
    table_name='tv_shows_parquet',
    push_down_predicate='primary_genre=="Action" OR primary_genre=="Comedy"'
)

For data catalog tables that point to JDBC data sources, you use the sampleQuery option to specify a SQL query that is used to fetch data into the dynamic frame. You can reduce the number of rows by specifying one or more WHERE clauses in the sample query. Additionally, you can reduce the number of columns that are fetched by listing specific columns in the SELECT clause of the query. Using the same example as above, here's how you would write the pushdown predicate for a JDBC table:

dyf = glueContext.create_dynamic_frame.from_catalog(
    database='imdb',
    table_name='tv_shows_jdbc',
    additional_options = { 
        "enablePartitioningForSampleQuery": True,
        "hashfield": "id",
        "hashpartitions": 20,
        "sampleQuery": "SELECT * FROM tv_shows WHERE primary_genre in ('Action', 'Comedy') AND"
    }
)

Note that the sample query will use the table and column names from the source database, not the table and column names in the Glue Data Catalog!

You can find more information about both types of pushdown predicates in the "Optimizing reads with pushdown in AWS Glue ETL" chapter of the Glue User Guide.

AWS
EXPERT
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.