Reading multiple CSV files from S3 bucket which start from a specific string in AWS Glue

1

Hello all, I have multiple CSV files in S3 bucket with same schema. All the files has same schema and name of all these CSV files starts with "DUP" string. I want to build an AWS glue job that can read all these files whoe name start with "DUP" from S3 bucket. I have created a crawler that extracts schema of these files and store in the Glue catalog. Is there any component available in Glue that i can use to read all these files process them one by one and store processed files in another folder of the S3 bucket. I want a single Glue job that can do that. Any answer or suggestion will be highly appreciated thank you.

  • Is there a reason you want to read the files one at a time? Generally it would be inefficient to do so in Spark.

1 réponse
0

Hi.

You can use Glue DynamicFrame API - is similar to a DataFrame, except that each record is self-describing, so no schema is required initially [1]. You can read the CSV files, process them, and store the processed files in another folder in the S3 bucket. Here is an example on how to achieve this in AWS Glue job.

  1. Create a new AWS Glue job and specify a data source for your input files using the create_dynamic_frame.from_catalog method. Provide the catalog database and table names where your crawler has stored the schema information.
  2. Filter the dynamic frame to only include files whose name starts with "DUP" using the ‘filter’ transformation.
  3. Perform any necessary transformations or processing on the dynamic frame using various transformation functions available in Glue
  4. Finally, use the write_dynamic_frame method to write the processed dynamic frame to the desired location in your S3 bucket.

With these configurations you can read all the CSV files starting with "DUP," process them, and store the processed files in another folder within the same S3 bucket. Remember to set up appropriate IAM roles and permissions for your Glue job to access the necessary resources in your AWS environment. I hope this helps! Let me know if you have any further questions.

Thank you

References [1] DynamicFrame Class https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.htm

AWS
répondu il y a 9 mois

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions