A collection of problems (and solutions) that we’ve faced to implement DynamoDB and Hive.
Writing to DynamoDB depends of # Reducers and # Write Capacity
When we start working with DynamoDB we found the Reducer phase run for a long time, ie: hours instead of expected minutes, without consuming the entire Dynamo’s Write Capacity.
Each AMI comes with a pre-configured quantity of reducers, you can check the default values here: Configure Hadoop Settings with a Bootstrap Action
We had to change
mapred.tasktracker.reduce.tasks.maximum variable, for us it made sense to double the number of reducers, once most of our jobs were simple and exports information to DynamoDB.
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args "-m,mapred.tasktracker.reduce.tasks.maximum=4"
On our Hive script we define the number of reducer to a prime number, as recommended at this post: Optimizing for Large EMR Clusters
-- Force more reducer to write quicker to DynamoDB -- should not exceed our Write Capacity SET mapred.reduce.tasks=11; SET hive.exec.reducers.max=11; -- tell EMR to use 100% of the available write throughput (by default it will use 50%). -- Note that this can adversely affect the performance of other applications simultaneously using DynamoDB SET dynamodb.throughput.write.percent=1.0;
Hadoop uses Speculative Execution to gain performance, starting several maps or reducers tasks in parallel, keeping the one run finish first.
This behavior is desired if your mapper or reducer is not connecting with external resources that could starve, ie: Connecting directly on a MySQL datasource.
To prevent this behavior:
hive> SET mapred.map.tasks.speculative.execution = false; hive> SET mapred.reduce.tasks.speculative.execution = false;
Those ideas came from reading a lot of articles, more