Glue ETL job from Dynamo to s3 with selective data loaded to s3


I have a working lab setup that has a glue job extract all data from a single dynamodb table to s3 in json format. This was done with the super simple setup using the AWS Glue Dynamo connector, all through the glue visual editor. I plan to run the job daily to refresh the data. The job is setup with Glue 3.0 & Python 3. Two questions:

  1. I assume I need to purge/delete the s3 objects from the previous ETL job each night - how is this done within glue or do I need to handle it outside of glue?
  2. I would like to update that job to limit the data sent to s3 to only include dynamodb records that have a specific key/value (status <> 'completed') so that I am not loading all of the dynamo data into my target. I dont care if the job has to get ALL of the dynamo table during extract and then filters it out during the transform phase, or if there is a way to selectively get data during the extract phase even better.

If anyone could advise with a simple example I would appreciate it. While I have looked for a little bit, I havent found much quality educational material, so happy to take any suggestions there as well (other than the AWS documentation - I have that, but need some initial direction/reference/101 hands on).

gefragt vor einem Jahr149 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen