Spark Summit 2014 - Day 1 (Afternoon)

2014-06-30

https://blog.arjon.es/2014/spark-summit-2014-day-1-afternoon/

#spark
#data

Following the Spark Summit 2014 - Afternoon talks notes

# Performing Advanced Analytics on Relational Data with Spark SQL

Michael Armbrust (Databricks) Slides

Shark - uses hive optimizer which wasn’t designed to Spark Ending active development of Shark!
SchemaRDD <–> RDD provides an unified abstraction to work.

SchemaRDD using Scala supports nested Case Class to define tables!
You can leverage on the Hive efforts due to wrappers and translators
Spark SQL relies on Scala Reflection to improve performance
Spark SQL is many times faster Shark

## Demo:

# Using Spark Streaming for High Velocity Analytics on Cassandra

Albert P Tobey, Tupshin Harper (Datastax)

“Your house is burning down - 10 minutes ago” - a good example of RT data needs :)

# Applying the Lambda Architecture with Spark

Jim Scott (MapR Technologies)

Lambda Architecture:

Hadoop: Everyone try to make it solution for everything

You can implement Spark as Lambda Architecture a unified platform

Cisco/MapR implementation of LA. Used to analyze security threats

Spark Stack - use cases of Spark’s usage

# Testing Spark: Best Practices

Anupama Shetty, Neil Marshall (Ooyala)

Ooyala Application: Player Events -> Kafka -> Spark Log Processor (batch + stream) -> CDH5/HDFS

Spark Log Processor: Json/Thrift

## Testing pipeline setup

Watir for player’s usage simulation
Kafka+Zookeeper locally (vagrant)
Spark running on local mode

## Stress Testing

Gatling: Stress test tool: homepage

Run performance tests on Jenkins
Setup baselines for any performance tests w/ different scenarios & users
Document your hardware so you can have information to tweak this as you need

# ElasticSpark: Building 1000 node elastic Spark clusters on Amazon Elastic MapReduce

Manjeet Chayel (Amazon Web Services)

Integrated to AWS environment:

Bootstrap actions to install Spark on EMR (currently Spark 0.8 supported)
Spark logs are shipped to S3
Easy launch, using elastic-mapreduce command line tool
Run Spark and Shark on Amazon Elastic MapReduce

# Streamlining Search Indexing using Elastic Search and Spark

Holden Karau (Databricks) - @holdenkarau

Source code: Elastic Search on Spark

Spotting the differences between offline and online indexing are hard
Writing data to ElasticSearch must use ESOutputFormat for Hadoop and then call myRDD.saveAsHadoopDataset(jobconf)

# Easy JSON Data Manipulation in Spark

Yin Huai (Databricks)

Working with JSON are easy:

1- 1 line of code to load the dataset 1- register the dataset as a table

Interface to use here and in the future: sqlContext.jsonRDD(data)

## Future Work

Better/Easily handling corrupted data
JSON column
SQL DDL commands for defining JSON data sources
Support for semi-structured such as CSV file

Will be available at Spark 1.0.1

# StreamSQL on Spark: Manipulating Streams by “SQL” using Spark

Grace(Jie) Huang, Jerry(Saisai) Shao (Intel)

Open-source framework: A Real-Time Analytical Processing (RTAP) example using Spark/Shark

To manipulate stream data like static data!
Output of the StreamSQL is a Stream itself!
Built on top of SparkStreaming + Catalyst

DEMO:

## Future Work

Time-based window function
More generic physical plan design (rule-based)
Enrich DDL operations. ie: create stream as table
Support more streaming source (Flume)
CLI for Catalyst and StreamSQL

# Spark Job Server: Easy Spark Job Management

Evan Chan, Kelvin Chu (Ooyala, Inc.)

GH repo
Spark Job Server is a vision to provide Spark as a Service internally on Ooyala
Persistence

## Future Plans

HA and Hot Failover for Jobs

Mutable Ideas

Spark Summit 2014 - Day 1 (Afternoon)

# Performing Advanced Analytics on Relational Data with Spark SQL

## Demo:

# Using Spark Streaming for High Velocity Analytics on Cassandra

# Applying the Lambda Architecture with Spark

# Testing Spark: Best Practices

## Testing pipeline setup

## Stress Testing

# ElasticSpark: Building 1000 node elastic Spark clusters on Amazon Elastic MapReduce

# Streamlining Search Indexing using Elastic Search and Spark

# Easy JSON Data Manipulation in Spark

## Future Work

# StreamSQL on Spark: Manipulating Streams by “SQL” using Spark

## Future Work

# Spark Job Server: Easy Spark Job Management

## Future Plans

# That’s all for today! Going to Reception now :)

Read next