- Newest
- Most votes
- Most comments
Hi Fred,
As you may know Glue Connection to DynamoDB is an abstraction on BatchWriteItems API call for DynamoDB, wherein it writes in batches of 25 items per request. As you are using postal_code as the partition key, if two items in a batch of 25 items contain the same postal_code, then you will receive this exception.
Before writing out to sink, you could convert your DynamicFrame to a Dataframe and call either distinct or dropDuplicates on the postal_code column. You must then convert back to DynamicFrame to make use of DynamoDB as a sink.
df
.select("postal_code")
.distinct
.withColumn("postal_code","other", "other1"))
.show()
Another thing worth checking is that you are not reading in CSV headers, this could also be the cause of having duplicates in the same batch. You can set the option when reading the CSV with the following param:
'withHeader': False
Hi Leeroy, thanks for the point. So, we realized the deduplicate stage doesn't work well under certain conditions. We reinforced it and our Glue process works well now. Best regards, Fred
Relevant content
- asked 4 years ago
- asked 2 years ago
