A data science toolkit inside a docker image, build it once, run everywhere

If you never heard about Jupyter Notebook, I highly recommend you to check it out. It have been my primary platform to build reports and data driven case studies. On this post I’d like to show how I create a simple and isolated environment with a Bash script and Docker to run JupyterLab. Recently Jupyter Notebook received a major overhauling and become JupyterLab - currently in beta, but the new platform looks fresh and very powerful. Continue Reading »

Reasons to fall in love for Postgres

I’ve been working on analytics/big data field for 10+ years, during this time I’ve been working mostly with MySQL, MongoDB, Redis and Cassandra. Just a couple of years ago I started to really pay attention to Postgres, and my regret is not getting into it earlier… On this post I try to enumerate a few features I’m using and why I think you should try it too, before jumping into the architectural and operational complexity of multiple NoSQL. Continue Reading »

Instalando Datastax Analytics (Cassandra y Spark) con Azure Templates

La última semana tuve la oportunidad de contar la experiencia de Socialmetrix instalando y configurando clusters de Datastax Analytics en Azure. Datastax brinda una solución comercial en un bundle, conteniendo Cassandra, Spark y Solr integrados. Las charlas se dieron en Argentina Big Data Meetup. Hosted by Jampp y el Nardoz Meetup. Hosted by Medallia

Continue Reading »

Where to Find Datasets to Learn Big Data & Data Science

Sometimes you just need data to learn how a algorithm works, to run a stress test or just to have a excuse to spin up several machines in a cluster and see how it crush the data. More often than not, it is incredibly hard to obtain data, and a few colleagues I’ve talked about had similar problem, so this post is a collection of links and references for datasets I know have been open source. Please contribute =)

Continue Reading »

Making Hadoop 2.6 + Spark-Cassandra driver play nice together

We have been using Spark Standalone deploy for more than one year now, but recently I tried to use Azure’s HDInsight which runs on Hadoop 2.6 (YARN deploy).

After provisioning the servers, all small tests worked fine, I have been able to run Spark-Shell, read and write to Blob Storage, until I tried to write to Datastax Cassandra cluster which constantly returned a error message: Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {10.0.1.4}:9042

Continue Reading »

Vagrant + Spark + Zeppelin a toolbox to the Data Analyst (or Data Scientist)

Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI. This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the master branch).

Continue Reading »