Loading data into DynamoDB by using Glue Studio, resulting in Provided list of item keys contains duplicates

0

Hi,

I have to load a file containing some postal codes (around 200.000, CSV), stored onto S3, to a DynamoDB table. I'm following two approaches, on based on Lambda, one based on Glue Studio.

In the Glue Studio case, the graph is pretty simple, based on two nodes, where the first one loads the data from S3 and the second one specifies the writing option into a custom node ; the primary key being the postal code, without sorting key. As following :

glueContext.write_dynamic_frame_from_options( frame=dfc, connection_type="dynamodb", connection_options={ "dynamodb.output.tableName": "postalCodes" } )

Moreover, the run fails systematically after inserting the first 50 entries and with the error: Provided list of item keys contains duplicates. But, the original list doesn't contain any duplicate.

So, what is the point I've missed?

Thanks in advance,

Fred

asked 3 years ago2493 views
2 Answers
3
Accepted Answer

Hi Fred,

As you may know Glue Connection to DynamoDB is an abstraction on BatchWriteItems API call for DynamoDB, wherein it writes in batches of 25 items per request. As you are using postal_code as the partition key, if two items in a batch of 25 items contain the same postal_code, then you will receive this exception.

Before writing out to sink, you could convert your DynamicFrame to a Dataframe and call either distinct or dropDuplicates on the postal_code column. You must then convert back to DynamicFrame to make use of DynamoDB as a sink.

df
  .select("postal_code")
  .distinct
  .withColumn("postal_code","other", "other1"))
  .show()

Another thing worth checking is that you are not reading in CSV headers, this could also be the cause of having duplicates in the same batch. You can set the option when reading the CSV with the following param:

'withHeader': False

profile pictureAWS
EXPERT
answered 3 years ago
profile picture
EXPERT
reviewed 4 months ago
0

Hi Leeroy, thanks for the point. So, we realized the deduplicate stage doesn't work well under certain conditions. We reinforced it and our Glue process works well now. Best regards, Fred

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions