1 Answer
- Newest
- Most votes
- Most comments
0
If I understand well, you are afraid that some of the new files (the old one are filtered out by the bookmark) contain duplicates, is my assumption correct?
If that is correct you could have a couple of different approaches:
- If you want to remove duplicate based on the entire row (all columns match) you can use spark
distinct()
function - If you want to remove rows that are duplicated based on matching your key, but you are not interested in getting a specific row you can use the spark function
dropDuplicates()
- In case you have another column that you could use to find the row you want to keep (for example a datetime column with last_updated values), you could use the spark window function row_number() partitioning by your key and ordered by the datetime column desc and then keep only the row with rownum=1.
for example on the first 2 points you can look here.
hope this helps
Relevant content
- asked 4 months ago
- asked a month ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
@fabioaws ok that makes sense, I also realized I could just filter out dupes within athena? I built my first and fully automated etl pipeline so still grokking a few concepts. I understand now that using bookmarking the input data is only a subset, so I can't filter against the full target data source without querying it and then running .distinct()/filter() etc? Seems like that may be overkill or inefficient vs filtering it out within athena or quicksight itself?
Thanks!