Skip to main content

Mutable Ideas

Spark Summit 2014 - Day 2

Following the Spark Summit 2014 - Day 2

# The Emergence of the Enterprise Data Hub

Mike Olson (Chief Strategy Officer, Cloudera)

  • Cloudera plans to port Hadoop Ecosystem to Spark, as replacement to M/R. img

  • Cloudera will keep support Impala, among Spark components. IMHO, it is split efforts and I can understand why they are doing this, beside biz decision of course! img


# The Future of Spark

Patrick Wendell (Databricks)

## Goals of project

  • Empower Data scientists and engineers to do their job
  • Expressive & clean API
  • Unified runtime across many environments
  • Powerful standard libraries

## API

  • Focus on API stability on Spark 1.0+ (breaking patchs are automatically rejected)
    • Minor: Every 3 months (1.1 August), 1.2, 1.3
    • Maintenance are kept active 1.0.1, 1.0.2, etc

## Future is about libraries

  • Focus on high-level libraries
  • Packaged and distributed w/ Spark to provide full inter-operability

## Spark SQL

  • More active process
  • Notion of schema RDDs
  • Focus now are:
    • Optimization
    • Language extension (towards SQL92)
    • Integration img

## What about Shark?

  • Will be replaced by Spark SQL.
  • JDBC server component preview on 1.0.1
  • Final release to 1.1 img

## Spark Core

  • Allow extension/innovation by defining internal API’s
  • Internal Storage API
  • Spark shuffle API (sort-based, pipeline)

## Timeline

Spark 1.0.1

  • JSON Support

Spark 1.1

  • Generalized Shuffle Interface
  • MLlib stats algorithms
  • JDBC Server
  • Sort-based shuffle

Spark 1.2

  • Refactor Storage Engine

Spark 1.3+

  • SparkR


# Beyond Analytics — Building Data Products for Data Natives

Monica Rogati - @mrogati (VP of Data at Jawbone)

## Data Natives:

  • Beyond digital natives, expect smart and seamlessly adapt
  • Expect things to KNOW what they want, ie: Expect the thermostat programs itself
  • The promise: better, richer, easier lives
  • quite not there yet! img

## Data Products:

Context, Personalization by Using Data, from You, Others and The World

  • How data product can drive life changes (eat, sleep, exercise, achieve your goals)

    Data Science is not about charts and Graphs is about delivery better experiences

## Analytics + Exploration to Build Data Products

  1. Good Instrumentation
  2. Reliable Data Flow (fault tolerance, scalable)
  3. Data Cleanup
  4. Fast Iteration (if it takes 30min to have a top distro, we not gonna check the data)
  5. Good UX

More than that: img

The virtuous cycle of smart interactions: More & better data comes from better UX, ie: Auto-complete for food app. img


# Break for Lunch!