A collection of problems (and solutions) that we’ve faced to implement DynamoDB and Hive.
Writing to DynamoDB depends of # Reducers and # Write Capacity
When we start working with DynamoDB we found the Reducer phase run for a long time, ie: hours instead of expected minutes, without consuming the entire Dynamo’s Write Capacity.
Each AMI comes with a pre-configured quantity of reducers, you can check the default values here: Configure Hadoop Settings with a Bootstrap Action
We had to change
mapred.tasktracker.reduce.tasks.maximum variable, for us it made sense to double the number of reducers, once most of our jobs were simple and exports information to DynamoDB.
On our Hive script we define the number of reducer to a prime number, as recommended at this post: Optimizing for Large EMR Clusters
1 2 3 4 5 6 7 8
Hadoop uses Speculative Execution to gain performance, starting several maps or reducers tasks in parallel, keeping the one run finish first.
This behavior is desired if your mapper or reducer is not connecting with external resources that could starve, ie: Connecting directly on a MySQL datasource.
To prevent this behavior:
Those ideas came from reading a lot of articles, more