Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

Spark Summit 2014 - Day 2

Following the Spark Summit 2014 - Day 2. Day 1 Morning notes | Day 1 Afternoon notes

The Emergence of the Enterprise Data Hub

Mike Olson (Chief Strategy Officer, Cloudera)

  • Cloudera plans to port Hadoop Ecosystem to Spark, as replacement to M/R. img

  • Cloudera will keep support Impala, among Spark components. IMHO, it is split efforts and I can understand why they are doing this, beside biz decision of course! img


The Future of Spark

Patrick Wendell (Databricks)

Goals of project

  • Empower Data scientists and engineers to do their job
  • Expressive & clean API
  • Unified runtime across many environments
  • Powerful standard libraries

API

  • Focus on API stability on Spark 1.0+ (breaking patchs are automatically rejected)
    • Minor: Every 3 months (1.1 August), 1.2, 1.3
    • Maintenance are kept active 1.0.1, 1.0.2, etc

Future is about libraries

  • Focus on high-level libraries
  • Packaged and distributed w/ Spark to provide full inter-operability

Spark SQL

  • More active process
  • Notion of schema RDDs
  • Focus now are:
    • Optimization
    • Language extension (towards SQL92)
    • Integration img

What about Shark?

  • Will be replaced by Spark SQL.
  • JDBC server component preview on 1.0.1
  • Final release to 1.1 img

Spark Core

  • Allow extension/innovation by defining internal API’s
  • Internal Storage API
  • Spark shuffle API (sort-based, pipeline)

Timeline

Spark 1.0.1

  • JSON Support

Spark 1.1

  • Generalized Shuffle Interface
  • MLlib stats algorithms
  • JDBC Server
  • Sort-based shuffle

Spark 1.2

  • Refactor Storage Engine

Spark 1.3+

  • SparkR

Next Spark Summit will be NYC! Early 2015


Beyond Analytics — Building Data Products for Data Natives

Monica Rogati - @mrogati (VP of Data at Jawbone)

Data Natives:

  • Beyond digital natives, expect smart and seamlessly adapt
  • Expect things to KNOW what they want, ie: Expect the thermostat programs itself
  • The promise: better, richer, easier lives
  • quite not there yet! img

Data Products:

Context, Personalization by Using Data, from You, Others and The World * How data product can drive life changes (eat, sleep, exercise, achieve your goals)

Data Science is not about charts and Graphs is about delivery better experiences

Analytics + Exploration to Build Data Products

  1. Good Instrumentation
  2. Reliable Data Flow (fault tolerance, scalable)
  3. Data Cleanup
  4. Fast Iteration (if it takes 30min to have a top distro, we not gonna check the data)
  5. Good UX

More than that: img

The virtuous cycle of smart interactions: More & better data comes from better UX, ie: Auto-complete for food app. img


Break for Lunch!

Keep reading: Day 2 - Afternoon Notes here