Skip to main content

Mutable Ideas

Tag: spark

Instalando Datastax Analytics (Cassandra y Spark) con Azure Templates

La última semana tuve la oportunidad de contar la experiencia de Socialmetrix instalando y configurando clusters de Datastax Analytics en Azure. Datastax brinda una solución comercial en un bundle, conteniendo Cassandra, Spark y Solr integrados. Las charlas se dieron en Argentina Big Data Meetup. Hosted by Jampp y el Nardoz Meetup. Hosted by Medallia

Making Hadoop 2.6 + Spark-Cassandra driver play nice together

We have been using Spark Standalone deploy for more than one year now, but recently I tried to use Azure’s HDInsight which runs on Hadoop 2.6 (YARN deploy).

After provisioning the servers, all small tests worked fine, I have been able to run Spark-Shell, read and write to Blob Storage, until I tried to write to Datastax Cassandra cluster which constantly returned a error message: Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {10.0.1.4}:9042

Reading compressed data with Spark using unknown file extensions

This post could also be called Reading .gz.tmp files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from .gz.tmp to .gz, therefore those files are unavailable to be read by SparkContext.textFile API. This post presents our workaround to process those files.

Vagrant + Spark + Zeppelin a toolbox to the Data Analyst (or Data Scientist)

Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI. This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the master branch).