Reading multiple CSV files from S3 bucket which start from a specific string in AWS Glue

1

Hello all, I have multiple CSV files in S3 bucket with same schema. All the files has same schema and name of all these CSV files starts with "DUP" string. I want to build an AWS glue job that can read all these files whoe name start with "DUP" from S3 bucket. I have created a crawler that extracts schema of these files and store in the Glue catalog. Is there any component available in Glue that i can use to read all these files process them one by one and store processed files in another folder of the S3 bucket. I want a single Glue job that can do that. Any answer or suggestion will be highly appreciated thank you.

  • Is there a reason you want to read the files one at a time? Generally it would be inefficient to do so in Spark.

1 Answer
0

Hi.

You can use Glue DynamicFrame API - is similar to a DataFrame, except that each record is self-describing, so no schema is required initially [1]. You can read the CSV files, process them, and store the processed files in another folder in the S3 bucket. Here is an example on how to achieve this in AWS Glue job.

  1. Create a new AWS Glue job and specify a data source for your input files using the create_dynamic_frame.from_catalog method. Provide the catalog database and table names where your crawler has stored the schema information.
  2. Filter the dynamic frame to only include files whose name starts with "DUP" using the ‘filter’ transformation.
  3. Perform any necessary transformations or processing on the dynamic frame using various transformation functions available in Glue
  4. Finally, use the write_dynamic_frame method to write the processed dynamic frame to the desired location in your S3 bucket.

With these configurations you can read all the CSV files starting with "DUP," process them, and store the processed files in another folder within the same S3 bucket. Remember to set up appropriate IAM roles and permissions for your Glue job to access the necessary resources in your AWS environment. I hope this helps! Let me know if you have any further questions.

Thank you

References [1] DynamicFrame Class https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.htm

AWS
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions