Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

Vagrant + Spark + Zeppelin a Toolbox to the Data Analyst (or Data Scientist)

Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI. This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the master branch).

Getting Started

There are a few steps to build this VM:

1- Follow this instructions to install Vagrant After complete the installation, check your Vagrant version, it is required to be >= 1.5:

1
2
3
$ vagrant version
Installed Version: 1.7.4
Latest Version: 1.7.4

2- Clone this repository:

1
$ git clone https://github.com/arjones/vagrant-spark-zeppelin.git

3- Run vagrant to build the image:

1
2
$ cd vagrant-spark-zeppelin
$ vagrant up

Depending of your internet speed it can take up to 1 hour to build, it takes so long, because there is no binary distribution of Zeppelin and we have to download all dependencies and build it from sources. Be patient, I’m sure you will enjoy the final result.

User Interfaces Available

This VM exposes Spark UI (port 4040):

Spark UI DAG Viewer - Join Job

And Zeppelin UI (port 8080), which several screenshots available below:

Running examples

When your Vagrant build is finished, you can connect directly to the Zeppelin’s UI and check a few notebooks.

Analyzing Access Logs

zeppelin access log example

This example uses Spark Core + DataFrames (SparkSQL) to analyze accesslogs (Apache Web Server Logs) released by Andy Baio, I downloaded the original dataset and split it into monthly files, because it increases the parallelism to Spark and also is a more real example. You can have fun creating the same analysis Andy Baio created ;)

Consuming Tweets in Realtime with Spark Streaming

zeppelin spark streaming

This example is composed of two notebooks, the first connects to Twitter’s stream and the second consumes the RDD and displays the results.

In order to run this sample, you’ll need to have Twitter’s application credentials, you must create your own at http://apps.twitter.com.

Create your application:

Twitter's Create an App

On the tab Keys and Access Tokens press the Create my access token button:

Twitter's Api Keys

Copy all keys:

Twitter's Access Tokens

And paste them to the notebook:

Twitter's Access Tokens

Troubleshooting

Build Problem

A couple of times when I tried to provision Zeppelin, maven failed to build due some dependency that could not be downloaded. You can finish building it manually, just login into the box and run the commands bellow:

1
2
3
4
5
6
7
8
9
10
$ vagrant ssh

vagrant@debian-jessie:~$ sudo su
root@debian-jessie:/root# cd /usr/zeppelin/
root@debian-jessie:/usr/zeppelin# export MAVEN_VERSION=3.3.1
root@debian-jessie:/usr/zeppelin# export MAVEN_HOME=/usr/apache-maven-$MAVEN_VERSION
root@debian-jessie:/usr/zeppelin# ${MAVEN_HOME}/bin/mvn clean package \
  -Pspark-1.4 \
  -Dhadoop.version=2.4.0 \
  -Phadoop-2.4 -DskipTests

When the build finishes OK you can manually start Zeppelin:

1
vagrant@debian-jessie:~$ sudo /etc/init.d/zeppelin-daemon.sh start

Restarting Streaming/Registering Jars

If you need to restart the streaming context or register new external dependencies (jar files) you will get an error from Zeppelin, AFIK there is no way to reset the context from the UI, so you’ll need to login to the box and restart it:

1
2
$ vagrant ssh
vagrant@debian-jessie:~$ sudo /etc/init.d/zeppelin-daemon.sh restart

Using Spark-Shell

Zeppelin is really cool for visualizations and to throw some sql and see what happens, but when I have to write more than a few lines of code I miss the autocomplete feature and the speed of a console; that’s where spark-shell comes to rescue. Login to the box and run spark-shell from the console, the example below also loads to the context some external libraries (joda-time on this sample).

1
2
$ vagrant ssh
vagrant@debian-jessie:~$ spark-shell --packages 'joda-time:joda-time:2.8.2,org.joda:joda-convert:1.7'

Let’s build a better version together!

I would like to add more notebooks examples but I lack the time to build meaningful ML examples, if you would like to share your notebook or datasets please send me a pull request and let’s build together a better tool for the community!

Comments