- Newest
- Most votes
- Most comments
The only thing you miss is to test it.... :D
Remember that AWS Glue uses a lazy evaluation approach, which means it doesn’t immediately execute the transformations when you define them. Instead, it records the transformations in the DynamicFrame and executes them when an action is called. So, without an action (like writing the data out or converting to a DataFrame), it’s hard to say if your script will work as expected.
Hello,
Here are a few things you may need to notice:
- 'sample_query' is a full SQL query, whereas push_down_predicate should be a simple filter condition (e.g "column_name = value"), not a complex SQL query.
- When you use 'sample_query' and try to use 'enablePartitioningForSampleQuery', your sampleQuery must end with an 'AND' for AWS Glue to append partitioning conditions.
Thank you
The method you use to add pushdown predicates to the create_dynamic_frame.from_catalog calls depends on the data source. For data catalog tables that point to partitioned data on S3, you can use the push_down_predicate option to filter based on the table partitions (see "Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog" on the AWS Big Data blog for a great intro to S3 table partitioning). For example, let's say that you have a table containing movie data from IMDB which is partitioned based on the movie's genre. Your Glue data catalog table might have information like this under the "Partitions" tab:
If you wanted to only fetch movies from this table where the primary_genre partition is "Action" or "Comedy", you would write a push_down_predicate option which references this partition:
dyf = glueContext.create_dynamic_frame.from_catalog(
database='imdb',
table_name='tv_shows_parquet',
push_down_predicate='primary_genre=="Action" OR primary_genre=="Comedy"'
)
For data catalog tables that point to JDBC data sources, you use the sampleQuery option to specify a SQL query that is used to fetch data into the dynamic frame. You can reduce the number of rows by specifying one or more WHERE clauses in the sample query. Additionally, you can reduce the number of columns that are fetched by listing specific columns in the SELECT clause of the query. Using the same example as above, here's how you would write the pushdown predicate for a JDBC table:
dyf = glueContext.create_dynamic_frame.from_catalog(
database='imdb',
table_name='tv_shows_jdbc',
additional_options = {
"enablePartitioningForSampleQuery": True,
"hashfield": "id",
"hashpartitions": 20,
"sampleQuery": "SELECT * FROM tv_shows WHERE primary_genre in ('Action', 'Comedy') AND"
}
)
Note that the sample query will use the table and column names from the source database, not the table and column names in the Glue Data Catalog!
You can find more information about both types of pushdown predicates in the "Optimizing reads with pushdown in AWS Glue ETL" chapter of the Glue User Guide.
Relevant content
- asked 3 years ago
- AWS OFFICIALUpdated 3 months ago
