Skip to main content

Mutable Ideas

Integrating MySQL/RDS with Hive

Sometimes the dataset I had to handle just wasn’t that big nor the original data was available at any other medium but MySQL/RDS. In order to link Hive to other storages its necessary a StorageHandler.

After some digging the best library I’ve been able to find was WSO2 Project : hive_0.8.1.wso2v8.jar. Once you download it and get the JAR, you must include it on your Hive Script or setup it as external libs:

ADD JAR /home/hadoop/lib/hive_0.8.1.wso2v8.jar;
ADD JAR /home/hadoop/lib/mysql-connector-java-5.1.25.jar;

LIST JARS;

Once libraries are available to Hive you can link a Hive’s external table with your real table, using something like:

CREATE EXTERNAL TABLE `hivetest_facebook` (
  `page_id` STRING,
  `post_id` STRING,
  `created_time` BIGINT,
  `type` INT,
  `message` STRING,
  `permalink` STRING
) STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
TBLPROPERTIES (
  'mapred.jdbc.driver.class' = 'com.mysql.jdbc.Driver',
  'mapred.jdbc.url' = 'jdbc:mysql://localhost/hivetest',
  'mapred.jdbc.username' = 'hivetest',
  'mapred.jdbc.password' = 'mypass!',
  'mapred.jdbc.output.table.name' = 'hivetest_facebook',
  'hive.jdbc.update.on.duplicate'= 'true',
  'hive.jdbc.primary.key.fields' = 'page_id,created_time,post_id'
);

Now we are able to INSERT to MySQL:

INSERT INTO TABLE hivetest_facebook 
SELECT page_id, post_id, created_time, type, message, permalink
FROM facebook_hdfs
WHERE created_time > 1392235405;

Or SELECT from MySQL, where we typically get information to make JOIN with some dataset:

SELECT * 
FROM facebook_activity fb 
INNER JOIN clients c ON fb.page_id = c.page_id
LIMIT 10;

Checkout the complete list of TBLPROPERTIES configuration parameters:

Property name Required Detail
mapred.jdbc.driver.class Yes The classname for the JDBC Driver to use. This should be available on Hive's classpath.
mapred.jdbc.url Yes The connection url for the database.
mapred.jdbc.username No The database username, if it's required.
mapred.jdbc.password No The database Password, if it's required.
hive.jdbc.table.create.query No If table already exist in the database, then you don't need this. Otherwise you should provide the sql query for creating the table in the database.
mapred.jdbc.output.table.name No The name of the table in the database. It does not have to be the same as the name of the table in Hive. If you have specified the sql query for creating the table, handler will pick the table name from query.
Otherwise you need to specify this if your meta table name is different from the table in database.
hive.jdbc.primary.key.fields Yes If you have any primary keys in the database table
hive.jdbc.update.on.duplicate No Expected values are either "true" or "false". If "true" then the storage handler will update the records with duplicate keys. Otherwise it will insert all data.
hive.jdbc.output.upsert.query No This can be use to optimize the update operation. The default implementation is to use insert or update statement after the select statement. So there will be two database round trips. But we can reduce it to one by using db specific upsert statement. Example query for mysql database is `INSERT INTO productSummary (product, itemsSold) values (?,?) ON DUPLICATE KEY UPDATE itemsSold=?`
hive.jdbc.upsert.query.values.order No If you are using an upsert query then this is mandatory. sample values for above query will be 'product,itemsSold,itemsSold' //values order for each question mark
hive.jdbc.input.columns.mapping No This is mandatory if your field names in meta table and database tables are different. Provide the field names in database table in the same order as the field names in meta table with ',' separated values. example: productNames,noOfItemsSold. These will map to your meta table with product,itemsSold field names.
mapred.jdbc.input.table.name No Used when reading from a database table. This is needed if the meta table name and database table name are different.

## A note about Costs vs. Performance

Some preliminary tests we made using AWS DynamoDB demonstrated the costs can grow really fast, especially if you have to write a lot to Dynamo and the type of information isn’t easy to distribute, ie: little variation on the HashKey.

Doesn’t help the performance side the fact DynamoDB StorageHandler doesn’t support batch operations, forcing reducer to apply items one by one.

On this scenario we found it was cheaper to just go with RDS.

I’m not saying RDS can replace DynamoDB because they are totally different beasts, but in several cases the resulting dataset can fit perfectly fine on SQL databases. I agree with Chris Stucchio in most of his arguments: Don’t use Hadoop - your data isn’t that big

We keep on our plan learning more about a couple of StorageHandler that outputs information to ElasticSearch and Cassandra.

## References

JDBC Storage Handler for Hive