Quick tips & tricks I learned working with Spark

2014-09-04

https://blog.arjon.es/2014/quick-tips-tricks-i-learned-working-with-spark/

#spark

A small collection of tips & tricks I learned working with Spark so far, I hope it can help you as well. If you have more tricks, please let me know!

## Tip 1 - Importing dependencies on Spark-Shell

Most of the time, I want to use spark-shell with my project dependencies, it can be Cassandra Driver, Case Class or help method. I have learned the best way is just import the whole fatjar (I’m using sbt assembly) containing the driver program and all its dependencies.

$ spark-shell --jars target/scala-2.10/SparkCassandra-assembly-0.1-SNAPSHOT.jar

## Tip 2 - Connecting to Cassandra from Spark-Shell

This trick is based on a small piece of the awesome post Installing the Cassandra / Spark OSS Stack from @AlTobey. I can connect to my Cassandra cluster from Spark Shell:

Note: it requires you bundle Spark Cassandra Connector inside your Driver Project.

$ spark-shell --jars target/scala-2.10/SparkCassandra-assembly-0.1-SNAPSHOT.jar

// stop the provided SparkConf, we need to create
// another with C* driver support
sc.stop

import org.apache.spark._
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._

// Configure the new Context
val conf = new SparkConf().
set("spark.cassandra.connection.host", "localhost"). // supports username/password
setAppName("My SHELL"). // displayed at SparkUI
set("spark.ui.port", "4041") // just to avoid the warning message if other shell/job is running locally

// Create the new SparkContext
val sc = new SparkContext(conf)

// Import models
import _root_.io.smx.ananke.spark.model._

val tweets = sc.cassandraTable[Tweet]("db", "tw_tweet")
tweets.first

## Tip 3 - Making Spark-Shell quieter

spark-shell is too verbose, I prefer a quieter version, so I’ve changed my log4j level from INFO to WARN and it’s much pleasant now. I installed Spark with homebrew, so my configuration is located here:

/usr/local/Cellar/apache-spark/1.0.2/libexec/conf/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1} %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN

## Tip 4 - Avoiding namespace clashes

This is actually a Scala trick, we use io.smx as our internal namespace, and Spark has a namespace org.apache.spark.io so each time I try to use my ns it clashes with Spark’s

import _root_.io.smx.ananke.spark.model._

## Tip 5 - Running initialization commands on SparkShell - Similar .bashrc

When I’m working on a project I need the same initialization code on spark shell, for example, connect Spark-Cassandra driver. In order to achieve that, I just use the switcher -i:

$ spark-shell -i ~/shell-init.scala --jars /projects/spark-cassandra-driver.jar

This is my shell-init.scala file:

// stop the provided SparkConf, we need to create
// another with C* driver support
sc.stop

import org.apache.spark._
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._

// Configure the new Context
val conf = new SparkConf().
set("spark.cassandra.connection.host", "localhost"). // supports username/password
setAppName("Spark-C* Shell"). // displayed at SparkUI
set("spark.ui.port", "4041") // changing default UI port, just to avoid the warning message if other shell/job is running locally

// Create the new SparkContext
val sc = new SparkContext(conf)

Do you know something that turns working with Spark even more pleasant? Get in touch and lets improve this list!

Mutable Ideas