What's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?

0

I have an etl pipeline that loads json data from a source bucket, runs an etl job with bookmarking enabled, and writes as parquet to a target bucket.

I'd like to ensure that the target bucket never contains duplicate records, what's the best way to achieve that?

My json records have a "requestid" field that's unique.

1개 답변
0

If I understand well, you are afraid that some of the new files (the old one are filtered out by the bookmark) contain duplicates, is my assumption correct?

If that is correct you could have a couple of different approaches:

  1. If you want to remove duplicate based on the entire row (all columns match) you can use spark distinct() function
  2. If you want to remove rows that are duplicated based on matching your key, but you are not interested in getting a specific row you can use the spark function dropDuplicates()
  3. In case you have another column that you could use to find the row you want to keep (for example a datetime column with last_updated values), you could use the spark window function row_number() partitioning by your key and ordered by the datetime column desc and then keep only the row with rownum=1.

for example on the first 2 points you can look here.

hope this helps

AWS
전문가
답변함 2년 전
  • @fabioaws ok that makes sense, I also realized I could just filter out dupes within athena? I built my first and fully automated etl pipeline so still grokking a few concepts. I understand now that using bookmarking the input data is only a subset, so I can't filter against the full target data source without querying it and then running .distinct()/filter() etc? Seems like that may be overkill or inefficient vs filtering it out within athena or quicksight itself?

    Thanks!

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인