- Newest
- Most votes
- Most comments
The emr-dynamodb-connector 's DynamoDBStorageHandler is used by EMR when Hive/Spark/MR is used to interact with DynamoDB tables. It determines a property maxParallelTasks
based on a dynamodb.throughput.write.percent
config among other parameters. The maxParallelTasks property is a function of MR engine (since connector was implemented when Hive on MR was default), hence if Tez/Spark engines are chosen for Hive, the writes and functioning of the connector doesn't change, however the Iops can no longer be controlled/estimated and the write rate majorly depends on the input size and EMR cluster capacity. This could lead to throttling on the table (not sure if those are the errors you refer to).
As long as Hive on MR is available, current suggestion to be able to control throughput while writing would be to use the MR engine and set it at runtime ( add "set hive.execution.engine=mr;" before anything in the hive script that interacts with DynamoDB). This would not require 2 EMR clusters until at a point the customer starts using a higher release of EMR (all releases of EMR today have it, so if these were continued to be used, no issues) that doesn't ship Hive with MR.
PS: Even with Hive on MR interacting with DynamoDB, EMR cluster capacity could be a secondary factor that determines how quickly the writes can be made, but the max will always be respected as per the dynamodb.throughput.write.percent config set.
Relevant content
- asked 3 years ago
- asked 3 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago