Skip to content

How to provide filter with date in glue python

0

I want to provide filter for table _test_rw_omuc_baag__test1_app where Modtime > '2019-01-01 00:00:00' in glue python script but not sure how to do it. This glue python i am generating from CDK code and i am not using gui in Glue to generate this python script.

Below is my code:

args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Script generated for node AWS Glue Data Catalog AKS
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(database="test", table_name="_test_rw_omuc_baag__test1_app", additional_options = {"jobBookmarkKeys":["seps"],"jobBookmarkKeysSortOrder":"asc","enableUpdateCatalog": False}, transformation_ctx="AWSGlueDataCatalog_node")

# Convert DynamicFrame to DataFrame 
df_app = AWSGlueDataCatalog_node.toDF()

# Repartition the DataFrame to control output files APP
df_repartitioned_app = df_app.repartition(10)  

# Check for empty partitions and write only if data is present
if not df_repartitioned_app.rdd.isEmpty():
    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "|") \
        .save(output_path_app)
asked a year ago472 views
2 Answers
0
Accepted Answer

One solution is to use filter in DataFrame:

# Script generated for node AWS Glue Data Catalog AKS
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(
    database="test",
    table_name="_test_rw_omuc_baag__test1_app",
    additional_options={
        "jobBookmarkKeys": ["seps"],
        "jobBookmarkKeysSortOrder": "asc",
        "enableUpdateCatalog": False
    },
    transformation_ctx="AWSGlueDataCatalog_node"
)
# Convert DynamicFrame to DataFrame
df_app = AWSGlueDataCatalog_node.toDF()
# Apply filter for Modtime
df_filtered_app = df_app.filter(col("Modtime") > '2019-01-01 00:00:00')

which is a good method to do it. Also here is a more efficient way by using push_down_predicate, if you have partitioned your data already. Here is an example:

AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(
    database="test",
    table_name="_test_rw_omuc_baag__test1_app",
    push_down_predicate="Modtime > '2019-01-01 00:00:00'",
    additional_options={
        "jobBookmarkKeys": ["seps"],
        "jobBookmarkKeysSortOrder": "asc",
        "enableUpdateCatalog": False
    },
    transformation_ctx="AWSGlueDataCatalog_node"
)

You can check more details on push_down_predicate through the document[1]. In general, by using push_down_predicate, you can apply the filter directly on the partition metadata in the Data Catalog which is more efficient.

Reference:

[1] - https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns

AWS
answered a year ago
AWS
EXPERT
reviewed 9 months ago
  • thanks its working!

0

You can add a filter to your Glue Python script by using the filter method on the DynamicFrame. After creating the DynamicFrame from the catalog, add the following lines:

from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col

# Apply filter to DynamicFrame
filtered_dynamic_frame = DynamicFrame.fromDF(
    AWSGlueDataCatalog_node.toDF().filter(col("Modtime") > "2019-01-01 00:00:00"),
    glueContext,
    "filtered_dynamic_frame"
)

# Convert filtered DynamicFrame to DataFrame 
df_app = filtered_dynamic_frame.toDF()

This code first converts the DynamicFrame to a DataFrame, applies the filter using Spark SQL functions, and then converts it back to a DynamicFrame. The filter condition col("Modtime") > "2019-01-01 00:00:00" will keep only the records where the Modtime is after January 1, 2019.

Make sure to import the necessary functions at the beginning of your script. After applying this filter, you can proceed with your existing code for repartitioning and writing the data.

Remember to adjust the column name "Modtime" if it's different in your actual data. Also, ensure that the Modtime column in your data is in a format that can be compared with the string date you're providing.

Sources
DynamicFrame class - AWS Glue
Filter class - AWS Glue

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.