- Newest
- Most votes
- Most comments
One solution is to use filter in DataFrame:
# Script generated for node AWS Glue Data Catalog AKS
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(
database="test",
table_name="_test_rw_omuc_baag__test1_app",
additional_options={
"jobBookmarkKeys": ["seps"],
"jobBookmarkKeysSortOrder": "asc",
"enableUpdateCatalog": False
},
transformation_ctx="AWSGlueDataCatalog_node"
)
# Convert DynamicFrame to DataFrame
df_app = AWSGlueDataCatalog_node.toDF()
# Apply filter for Modtime
df_filtered_app = df_app.filter(col("Modtime") > '2019-01-01 00:00:00')
which is a good method to do it. Also here is a more efficient way by using push_down_predicate, if you have partitioned your data already. Here is an example:
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(
database="test",
table_name="_test_rw_omuc_baag__test1_app",
push_down_predicate="Modtime > '2019-01-01 00:00:00'",
additional_options={
"jobBookmarkKeys": ["seps"],
"jobBookmarkKeysSortOrder": "asc",
"enableUpdateCatalog": False
},
transformation_ctx="AWSGlueDataCatalog_node"
)
You can check more details on push_down_predicate through the document[1]. In general, by using push_down_predicate, you can apply the filter directly on the partition metadata in the Data Catalog which is more efficient.
Reference:
You can add a filter to your Glue Python script by using the filter method on the DynamicFrame. After creating the DynamicFrame from the catalog, add the following lines:
from awsglue.dynamicframe import DynamicFrame from pyspark.sql.functions import col # Apply filter to DynamicFrame filtered_dynamic_frame = DynamicFrame.fromDF( AWSGlueDataCatalog_node.toDF().filter(col("Modtime") > "2019-01-01 00:00:00"), glueContext, "filtered_dynamic_frame" ) # Convert filtered DynamicFrame to DataFrame df_app = filtered_dynamic_frame.toDF()
This code first converts the DynamicFrame to a DataFrame, applies the filter using Spark SQL functions, and then converts it back to a DynamicFrame. The filter condition col("Modtime") > "2019-01-01 00:00:00" will keep only the records where the Modtime is after January 1, 2019.
Make sure to import the necessary functions at the beginning of your script. After applying this filter, you can proceed with your existing code for repartitioning and writing the data.
Remember to adjust the column name "Modtime" if it's different in your actual data. Also, ensure that the Modtime column in your data is in a format that can be compared with the string date you're providing.
Sources
DynamicFrame class - AWS Glue
Filter class - AWS Glue
Relevant content
- asked a year ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 6 days ago

thanks its working!