Skip to content

How to add file pattern in AWS Glue ETL job python script

0

Hallo, I wanted to add file pattern in AWS Glue ETL job python script where it should generate the files in s3 bucket with pattern dostrp*.csv.gz but could not find way how to provide this file pattern in python script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

outputbucketname = args['target_BucketName']

# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node188777777 = glueContext.create_dynamic_frame.from_catalog(database="xxxx", table_name="xxxx", transformation_ctx="AWSGlueDataCatalog_node887777777")

# Script generated for node Amazon S3
AmazonS3_node55566666 = glueContext.write_dynamic_frame.from_options(frame=AWSGlueDataCatalog_node8877777777, connection_type="s3", format="csv", format_options={"separator": "|"}, connection_options={"path": outputbucketname, "compression": "gzip", "partitionKeys": []}, transformation_ctx="AmazonS3_node5566677777")

job.commit()
asked a year ago254 views
1 Answer
1
Accepted Answer

Hi RahulD,

To add a file pattern in your AWS Glue ETL job's Python script, you can modify the connection_options to include a custom file name or pattern. AWS Glue doesn’t directly support wildcard patterns for output filenames, but you can achieve a similar effect by dynamically setting the file name in the script.

Here's how you can modify your script to include a file pattern like dostrp*.csv.gz:

timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
filename = f"dostrp{timestamp}.csv.gz"
AmazonS3_node55566666 = glueContext.write_dynamic_frame.from_options(
    frame=AWSGlueDataCatalog_node188777777, 
    connection_type="s3", 
    format="csv", 
    format_options={"separator": "|"}, 
    connection_options={
        "path": f"{outputbucketname}/{filename}", 
        "compression": "gzip", 
        "partitionKeys": []
    }, 
    transformation_ctx="AmazonS3_node5566677777"
)

If this approach doesn’t fully meet your needs, please provide additional details about your requirements. I'm here to help.

EXPERT
answered a year ago
AWS
EXPERT
reviewed 9 months ago
  • Thanks Vitor. I tried the logic as you mentioned. But unfortunately the glue job run was succeded but it didnt generate the file in S3 bucket. When i remove this logic then it generated the file in s3 bucket.

    I tried all the different logic below with variable filename but it didn't worked:

    filename = f"dostrp{timestamp}.csv.gz" filename = f"dostrp{timestamp}.gz" filename = f"dostrp{timestamp}"

    As i already provided the below logic previously to generate gzip file with csv format does it has any impact or what might be the issue ? format="csv", connection_options={"path": outputbucketname, "compression": "gzip"

    Currently without your logic it generates the file like below: run-1723993970803-part-r-00005.gz

  • Hi @Vitor, also in s3 bucket when i unzip the gz file, the file has created in "file" type format and not csv format. Can you please tell what might be the issue?

  • Try this approach for custom file naming:

    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    filename = f"dostrp{timestamp}.csv"
    output_path = f"{outputbucketname}/{filename}.gz"
    
    AmazonS3_node55566666 = glueContext.write_dynamic_frame.from_options(
        frame=AWSGlueDataCatalog_node188777777,
        connection_type="s3",
        format="csv",
        format_options={"separator": "|"},
        connection_options={"path": output_path, "compression": "gzip", "partitionKeys": []},
        transformation_ctx="AmazonS3_node5566677777"
    )

    If the issue persists, please provide more details.

  • thanks now it works as expected :)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.