What's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?
0
I have an etl pipeline that loads json data from a source bucket, runs an etl job with bookmarking enabled, and writes as parquet to a target bucket.
I'd like to ensure that the target bucket never contains duplicate records, what's the best way to achieve that?
My json records have a "requestid" field that's unique.
asked 6 months ago288 views
1 Answers
0
If I understand well, you are afraid that some of the new files (the old one are filtered out by the bookmark) contain duplicates, is my assumption correct?
If that is correct you could have a couple of different approaches:
- If you want to remove duplicate based on the entire row (all columns match) you can use spark
distinct()
function - If you want to remove rows that are duplicated based on matching your key, but you are not interested in getting a specific row you can use the spark function
dropDuplicates()
- In case you have another column that you could use to find the row you want to keep (for example a datetime column with last_updated values), you could use the spark window function row_number() partitioning by your key and ordered by the datetime column desc and then keep only the row with rownum=1.
for example on the first 2 points you can look here.
hope this helps
Relevant questions
Update Records with AWS Glue
asked 3 months agoGlue ETL Job with external connection to Redshift - filter then extract?
Accepted Answerasked 4 years agoAWS Glue ETL Job: IllegalArgumentException: Missing collection name.
asked a month agoGlue ETL Job not working with error: o122.relationalize. com.amazonaws.services.glue.util.HadoopDataSourceJobBookmarkState cannot be cast to com.amazonaws.services.glue.util.RelationalizeJobBookmarkSt
asked 6 months agoWhat is the best ETL tool for ongoing loads of data into Redshift?
Accepted Answerasked 5 years agoFailed ETL Job
asked 4 years agoIncremental archive from DynamoDB, Glue Bookmarking?
Accepted Answerasked 3 years agoWhat's the best way to filter out duplicated records in a Glue ETL Job with bookmarking enabled?
asked 6 months agoGlue ETL job write part-r-00 files to same bucket as my input. Any way to change this?
Accepted Answerasked 3 months agoI need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?
Accepted Answerasked 3 months ago
@fabioaws ok that makes sense, I also realized I could just filter out dupes within athena? I built my first and fully automated etl pipeline so still grokking a few concepts. I understand now that using bookmarking the input data is only a subset, so I can't filter against the full target data source without querying it and then running .distinct()/filter() etc? Seems like that may be overkill or inefficient vs filtering it out within athena or quicksight itself?
Thanks!