What's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?

0

I have an etl pipeline that loads json data from a source bucket, runs an etl job with bookmarking enabled, and writes as parquet to a target bucket.

I'd like to ensure that the target bucket never contains duplicate records, what's the best way to achieve that?

My json records have a "requestid" field that's unique.

1 réponse
0

If I understand well, you are afraid that some of the new files (the old one are filtered out by the bookmark) contain duplicates, is my assumption correct?

If that is correct you could have a couple of different approaches:

  1. If you want to remove duplicate based on the entire row (all columns match) you can use spark distinct() function
  2. If you want to remove rows that are duplicated based on matching your key, but you are not interested in getting a specific row you can use the spark function dropDuplicates()
  3. In case you have another column that you could use to find the row you want to keep (for example a datetime column with last_updated values), you could use the spark window function row_number() partitioning by your key and ordered by the datetime column desc and then keep only the row with rownum=1.

for example on the first 2 points you can look here.

hope this helps

AWS
EXPERT
répondu il y a 2 ans
  • @fabioaws ok that makes sense, I also realized I could just filter out dupes within athena? I built my first and fully automated etl pipeline so still grokking a few concepts. I understand now that using bookmarking the input data is only a subset, so I can't filter against the full target data source without querying it and then running .distinct()/filter() etc? Seems like that may be overkill or inefficient vs filtering it out within athena or quicksight itself?

    Thanks!

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions