Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

Hive & DynamoDB Pitfalls

A collection of problems (and solutions) that we’ve faced to implement DynamoDB and Hive.

Writing to DynamoDB depends of # Reducers and # Write Capacity

When we start working with DynamoDB we found the Reducer phase run for a long time, ie: hours instead of expected minutes, without consuming the entire Dynamo’s Write Capacity.

Each AMI comes with a pre-configured quantity of reducers, you can check the default values here: Configure Hadoop Settings with a Bootstrap Action

We had to change mapred.tasktracker.reduce.tasks.maximum variable, for us it made sense to double the number of reducers, once most of our jobs were simple and exports information to DynamoDB.

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-m,mapred.tasktracker.reduce.tasks.maximum=4"

On our Hive script we define the number of reducer to a prime number, as recommended at this post: Optimizing for Large EMR Clusters

-- Force more reducer to write quicker to DynamoDB
-- should not exceed our Write Capacity
SET mapred.reduce.tasks=11;
SET hive.exec.reducers.max=11;

-- tell EMR to use 100% of the available write throughput (by default it will use 50%). 
-- Note that this can adversely affect the performance of other applications simultaneously using DynamoDB
SET dynamodb.throughput.write.percent=1.0;

Protecting resources

Hadoop uses Speculative Execution to gain performance, starting several maps or reducers tasks in parallel, keeping the one run finish first.

This behavior is desired if your mapper or reducer is not connecting with external resources that could starve, ie: Connecting directly on a MySQL datasource.

To prevent this behavior:

hive> SET = false;
hive> SET mapred.reduce.tasks.speculative.execution    = false;

The source

Those ideas came from reading a lot of articles, more