- Newest
- Most votes
- Most comments
Hi Fred,
As you may know Glue Connection to DynamoDB is an abstraction on BatchWriteItems
API call for DynamoDB, wherein it writes in batches of 25 items per request. As you are using postal_code
as the partition key, if two items in a batch of 25 items contain the same postal_code
, then you will receive this exception.
Before writing out to sink, you could convert your DynamicFrame to a Dataframe and call either distinct
or dropDuplicates
on the postal_code
column. You must then convert back to DynamicFrame to make use of DynamoDB as a sink.
df
.select("postal_code")
.distinct
.withColumn("postal_code","other", "other1"))
.show()
Another thing worth checking is that you are not reading in CSV headers, this could also be the cause of having duplicates in the same batch. You can set the option when reading the CSV with the following param:
'withHeader': False
Hi Leeroy, thanks for the point. So, we realized the deduplicate stage doesn't work well under certain conditions. We reinforced it and our Glue process works well now. Best regards, Fred
Relevant content
- asked 2 years ago
- asked 8 months ago
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago