Writing Data Directly from EMR into DynamoDB

0

A customer has explored writing data directly from EMR into DynamoDB. It works fine with one issue. In order to run the DynamoDBStorageHandler, they have to change the hive engine to use MR instead of TEZ. While we can do this it does mean that we'd need to run two EMR clusters. Can we confirm that their understanding is correct, they think it is as when they tested against TEZ they saw failures but did not see failures when using MR. And can we find out if there's a better way to do this without requiring 2 EMR clusters.

profile pictureAWS
EXPERT
asked 4 years ago1409 views
1 Answer
0
Accepted Answer

The emr-dynamodb-connector 's DynamoDBStorageHandler is used by EMR when Hive/Spark/MR is used to interact with DynamoDB tables. It determines a property maxParallelTasks based on a dynamodb.throughput.write.percent config among other parameters. The maxParallelTasks property is a function of MR engine (since connector was implemented when Hive on MR was default), hence if Tez/Spark engines are chosen for Hive, the writes and functioning of the connector doesn't change, however the Iops can no longer be controlled/estimated and the write rate majorly depends on the input size and EMR cluster capacity. This could lead to throttling on the table (not sure if those are the errors you refer to).

As long as Hive on MR is available, current suggestion to be able to control throughput while writing would be to use the MR engine and set it at runtime ( add "set hive.execution.engine=mr;" before anything in the hive script that interacts with DynamoDB). This would not require 2 EMR clusters until at a point the customer starts using a higher release of EMR (all releases of EMR today have it, so if these were continued to be used, no issues) that doesn't ship Hive with MR.

PS: Even with Hive on MR interacting with DynamoDB, EMR cluster capacity could be a secondary factor that determines how quickly the writes can be made, but the max will always be respected as per the dynamodb.throughput.write.percent config set.

profile pictureAWS
answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions