By using AWS re:Post, you agree to the Terms of Use
/What's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?/

What's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?


I have an etl pipeline that loads json data from a source bucket, runs an etl job with bookmarking enabled, and writes as parquet to a target bucket.

I'd like to ensure that the target bucket never contains duplicate records, what's the best way to achieve that?

My json records have a "requestid" field that's unique.

1 Answers

If I understand well, you are afraid that some of the new files (the old one are filtered out by the bookmark) contain duplicates, is my assumption correct?

If that is correct you could have a couple of different approaches:

  1. If you want to remove duplicate based on the entire row (all columns match) you can use spark distinct() function
  2. If you want to remove rows that are duplicated based on matching your key, but you are not interested in getting a specific row you can use the spark function dropDuplicates()
  3. In case you have another column that you could use to find the row you want to keep (for example a datetime column with last_updated values), you could use the spark window function row_number() partitioning by your key and ordered by the datetime column desc and then keep only the row with rownum=1.

for example on the first 2 points you can look here.

hope this helps

answered 6 months ago
  • @fabioaws ok that makes sense, I also realized I could just filter out dupes within athena? I built my first and fully automated etl pipeline so still grokking a few concepts. I understand now that using bookmarking the input data is only a subset, so I can't filter against the full target data source without querying it and then running .distinct()/filter() etc? Seems like that may be overkill or inefficient vs filtering it out within athena or quicksight itself?


You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions