Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

Quick Tips & Tricks I Learned Working With Spark

A small collection of tips & tricks I learned working with Spark so far, I hope it can help you as well. If you have more tricks, please let me know!

Tip 1 - Importing dependencies on Spark-Shell

Most of the time, I want to use spark-shell with my project dependencies, it can be Cassandra Driver, Case Class or help method. I have learned the best way is just import the whole fatjar (I’m using sbt assembly) containing the driver program and all its dependencies.

1
$ spark-shell --jars target/scala-2.10/SparkCassandra-assembly-0.1-SNAPSHOT.jar

Tip 2 - Connecting to Cassandra from Spark-Shell

This trick is based on a small piece of the awesome post Installing the Cassandra / Spark OSS Stack from @AlTobey. I can connect to my Cassandra cluster from Spark Shell:

Note: it requires you bundle Spark Cassandra Connector inside your Driver Project.

1
$ spark-shell --jars target/scala-2.10/SparkCassandra-assembly-0.1-SNAPSHOT.jar
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// stop the provided SparkConf, we need to create
// another with C* driver support
sc.stop

import org.apache.spark._
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._

// Configure the new Context
val conf = new SparkConf().
set("spark.cassandra.connection.host", "localhost"). // supports username/password
setAppName("My SHELL"). // displayed at SparkUI
set("spark.ui.port", "4041") // just to avoid the warning message if other shell/job is running locally

// Create the new SparkContext
val sc = new SparkContext(conf)

// Import models
import _root_.io.smx.ananke.spark.model._

val tweets = sc.cassandraTable[Tweet]("db", "tw_tweet")
tweets.first

Tip 3 - Making Spark-Shell quieter

spark-shell is too verbose, I prefer a quieter version, so I’ve changed my log4j level from INFO to WARN and it’s much pleasant now. I installed Spark with homebrew, so my configuration is located here:

1
2
3
4
5
6
7
8
9
10
11
12
13
/usr/local/Cellar/apache-spark/1.0.2/libexec/conf/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1} %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN

Tip 4 - Avoiding namespace clashes

This is actually a Scala trick, we use io.smx as our internal namespace, and Spark has a namespace org.apache.spark.io so each time I try to use my ns it clashes with Spark’s

1
import _root_.io.smx.ananke.spark.model._

Tip 5 - Running initialization commands on SparkShell - Similar .bashrc

When I’m working on a project I need the same initialization code on spark shell, for example, connect Spark-Cassandra driver. In order to achieve that, I just use the switcher -i:

$ spark-shell -i ~/shell-init.scala --jars /projects/spark-cassandra-driver.jar

This is my shell-init.scala file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// stop the provided SparkConf, we need to create
// another with C* driver support
sc.stop

import org.apache.spark._
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._

// Configure the new Context
val conf = new SparkConf().
set("spark.cassandra.connection.host", "localhost"). // supports username/password
setAppName("Spark-C* Shell"). // displayed at SparkUI
set("spark.ui.port", "4041") // changing default UI port, just to avoid the warning message if other shell/job is running locally

// Create the new SparkContext
val sc = new SparkContext(conf)

Do you know something that turns working with Spark even more pleasant? Get in touch and lets improve this list!