Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI.
This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the
There are a few steps to build this VM:
1- Follow this instructions to install Vagrant After complete the installation, check your Vagrant version, it is required to be >= 1.5:
1 2 3
2- Clone this repository:
3- Run vagrant to build the image:
Depending of your internet speed it can take up to 1 hour to build, it takes so long, because there is no binary distribution of Zeppelin and we have to download all dependencies and build it from sources. Be patient, I’m sure you will enjoy the final result.
User Interfaces Available
This VM exposes Spark UI (port 4040):
And Zeppelin UI (port 8080), which several screenshots available below:
When your Vagrant build is finished, you can connect directly to the Zeppelin’s UI and check a few notebooks.
Analyzing Access Logs
This example uses Spark Core + DataFrames (SparkSQL) to analyze
accesslogs (Apache Web Server Logs) released by Andy Baio, I downloaded the original dataset and split it into monthly files, because it increases the parallelism to Spark and also is a more real example. You can have fun creating the same analysis Andy Baio created ;)
Consuming Tweets in Realtime with Spark Streaming
In order to run this sample, you’ll need to have Twitter’s application credentials, you must create your own at http://apps.twitter.com.
Create your application:
On the tab Keys and Access Tokens press the Create my access token button:
Copy all keys:
And paste them to the notebook:
A couple of times when I tried to provision Zeppelin,
maven failed to build due some dependency that could not be downloaded. You can finish building it manually, just login into the box and run the commands bellow:
1 2 3 4 5 6 7 8 9 10
When the build finishes OK you can manually start Zeppelin:
Restarting Streaming/Registering Jars
If you need to restart the streaming context or register new external dependencies (jar files) you will get an error from Zeppelin, AFIK there is no way to reset the context from the UI, so you’ll need to login to the box and restart it:
Zeppelin is really cool for visualizations and to throw some
sql and see what happens, but when I have to write more than a few lines of code I miss the autocomplete feature and the speed of a console; that’s where
spark-shell comes to rescue.
Login to the box and run
spark-shell from the console, the example below also loads to the context some external libraries (joda-time on this sample).
Let’s build a better version together!
I would like to add more notebooks examples but I lack the time to build meaningful ML examples, if you would like to share your notebook or datasets please send me a pull request and let’s build together a better tool for the community!