Making Hadoop 2.6 + Spark-Cassandra driver play nice together

We have been using Spark Standalone deploy for more than one year now, but recently I tried to use Azure’s HDInsight which runs on Hadoop 2.6 (YARN deploy).

After provisioning the servers, all small tests worked fine, I have been able to run Spark-Shell, read and write to Blob Storage, until I tried to write to Datastax Cassandra cluster which constantly returned a error message: Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {10.0.1.4}:9042

Continue Reading »

Hadoop and small files

Hadoop is really bad to handle small files, the framework is heavy and was not designed to work on small files.

We process social signals, this means tons of small JSON from Twitter, Facebook, Google Plus API’s. In order to improve the overall performance we use mainly two technics:

Continue Reading »