Glue S3 clean staging area

1

Hi,

I'm writing an ETL that before inserting data into S3 needs to clean up the staging area, the staging area also is connected as a data catalog table. Ideally the ETL should: 1- clean S3 folder output 2- execute transformation 3- write to s3 output folder (the one cleaned before and associated to a data catalog table) 4-[update glue catalog]

The solution should be most visually possible instead of using custom code. My idea other than write a custom transform inside glue studio or a job with a python script was to use a step function to orchestrate the whole process (with some limitations). Any other ideas?

Thanks

EDIT - possible solution

If the data format is supported by Glue DataBrew a possible way could be to use it that has built-in capabilities to purge output folder, or by default every run create a new folder and the respective glue data catalog is updated to point only to the newly created folder. The orchestration could also be easily managed with step function without the need to write custom code

1 Answer
0

Hi Paolo,

I understand you would like to accomplish this ETL job with the visual editor as much as possible. Unfortunately at the time, there is not a native purge "Action" which can be used to clear out that S3 staging area. However, the "Custom Transform" would be a good fit and the GlueContext includes two methods which you can use to easily purge the staging data: purge_table and purge_s3_path. This could all be done within that job and I don't think the Step Functions state machine is necessary

Let me know if you have any other questions

Ref: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-purge_table

Loc D., AWS ProServe

AWS
Loc
answered a year ago
  • Hi, yes was my first go but since the users have no coding experience I looked at the step function solution. I also thought to create a single glue job with a custom script to purge s3 paths, this way the job could be used inside a glue workflow but it's not possible to set dynamic parameters as input other than mapping at workflow level and could be the case that the workflow manage many paths to be purged with a poor manageability on the long term. Thanks for the feedback

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions