Sometimes the dataset I had to handle just wasn’t that big nor the original data was available at any other medium but MySQL/RDS. In order to link Hive to other storages its necessary a StorageHandler.
After some digging the best library I’ve been able to find was WSO2 Project : hive_0.8.1.wso2v8.jar. Once you download it and get the JAR, you must include it on your Hive Script or setup it as external libs:
ADD JAR /home/hadoop/lib/hive_0.8.1.wso2v8.jar; ADD JAR /home/hadoop/lib/mysql-connector-java-5.1.25.jar; LIST JARS;
Once libraries are available to Hive you can link a Hive’s external table with your real table, using something like:
CREATE EXTERNAL TABLE `hivetest_facebook` ( `page_id` STRING, `post_id` STRING, `created_time` BIGINT, `type` INT, `message` STRING, `permalink` STRING ) STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' TBLPROPERTIES ( 'mapred.jdbc.driver.class' = 'com.mysql.jdbc.Driver', 'mapred.jdbc.url' = 'jdbc:mysql://localhost/hivetest', 'mapred.jdbc.username' = 'hivetest', 'mapred.jdbc.password' = 'mypass!', 'mapred.jdbc.output.table.name' = 'hivetest_facebook', 'hive.jdbc.update.on.duplicate'= 'true', 'hive.jdbc.primary.key.fields' = 'page_id,created_time,post_id' );
Now we are able to
INSERT to MySQL:
INSERT INTO TABLE hivetest_facebook SELECT page_id, post_id, created_time, type, message, permalink FROM facebook_hdfs WHERE created_time > 1392235405;
SELECT from MySQL, where we typically get information to make
JOIN with some dataset:
SELECT * FROM facebook_activity fb INNER JOIN clients c ON fb.page_id = c.page_id LIMIT 10;
Checkout the complete list of TBLPROPERTIES configuration parameters:
|mapred.jdbc.driver.class||Yes||The classname for the JDBC Driver to use. This should be available on Hive's classpath.|
|mapred.jdbc.url||Yes||The connection url for the database.|
|mapred.jdbc.username||No||The database username, if it's required.|
|mapred.jdbc.password||No||The database Password, if it's required.|
|hive.jdbc.table.create.query||No||If table already exist in the database, then you don't need this. Otherwise you should provide the sql query for creating the table in the database.|
The name of the table in the database. It does not have to be the same as the name of the table in Hive. If you have specified the sql query for creating the table, handler will pick the table name from query.
Otherwise you need to specify this if your meta table name is different from the table in database.
|hive.jdbc.primary.key.fields||Yes||If you have any primary keys in the database table|
|hive.jdbc.update.on.duplicate||No||Expected values are either "true" or "false". If "true" then the storage handler will update the records with duplicate keys. Otherwise it will insert all data.|
|hive.jdbc.output.upsert.query||No||This can be use to optimize the update operation. The default implementation is to use insert or update statement after the select statement. So there will be two database round trips. But we can reduce it to one by using db specific upsert statement. Example query for mysql database is `INSERT INTO productSummary (product, itemsSold) values (?,?) ON DUPLICATE KEY UPDATE itemsSold=?`|
|hive.jdbc.upsert.query.values.order||No||If you are using an upsert query then this is mandatory. sample values for above query will be 'product,itemsSold,itemsSold' //values order for each question mark|
|hive.jdbc.input.columns.mapping||No||This is mandatory if your field names in meta table and database tables are different. Provide the field names in database table in the same order as the field names in meta table with ',' separated values. example: productNames,noOfItemsSold. These will map to your meta table with product,itemsSold field names.|
|mapred.jdbc.input.table.name||No||Used when reading from a database table. This is needed if the meta table name and database table name are different.|
A note about Costs vs. Performance
Some preliminary tests we made using AWS DynamoDB demonstrated the costs can grow really fast, especially if you have to write a lot to Dynamo and the type of information isn’t easy to distribute, ie: little variation on the
Doesn’t help the performance side the fact DynamoDB StorageHandler doesn’t support batch operations, forcing reducer to apply items one by one.
On this scenario we found it was cheaper to just go with RDS.
I’m not saying RDS can replace DynamoDB because they are totally different beasts, but in several cases the resulting dataset can fit perfectly fine on SQL databases. I agree with Chris Stucchio in most of his arguments: Don’t use Hadoop - your data isn’t that big
We keep on our plan learning more about a couple of StorageHandler that outputs information to ElasticSearch and Cassandra.