Which skills should I learn to become a Big Data Engineer?

A few days ago I received an email from a student of Universidad Tecnológica Nacional asking me for advice about what kind of skills he needed acquire to be hired as Big Data Engineer, I felt it was something worth writing about and hopefully it can generate a sane debate and help more people.

Disclosure

The frameworks and books I’m referring here are based on my personal experience working on Digital Marketing and Analytics industry, if you want to learn bigdata to work on DNA/medical research or stocks trading, this post probably offers an incomplete reference.

You may also take this post as a reference if you want to work on Analytics companies similar to Socialmetrix.

I also assume you are a Senior Developer who understand general propose programming languages as Java or Python, you can manage a Linux boxes (plural) and understand SQL databases, ETL processes, etc.

Big Data field

A lot have been said about Big Data field: it’s a fad, it’s a bubble, it’s a marketing/business stunt.

I believe bigdata embraces great technics and tools to solve specific problems (emphasis on specific), also most of the complains comes from the fact people are trying to use bigdata to solve small/medium data problems, I can relate to the curiosity and the eagerness to try this new technology that everyone is talking about, but the hard truth is most of time you don’t need big data.

For those cases where bigdata is really necessary, it can be very complex when you need to combine batch processing, real time processing and different datastores, response times, etc - last year my colleague Sebastian Montini and I talked on Amazon AWS re:Invent about evolving our bigdata platform - our key learning was “use as few frameworks as possible to reduce the complexity”.

Also, a lot of frustration comes from the fact bigdata is in its infancy yet, most of the projects has been around for just a few years and they are isolated pieces you must glue together to create your solution - and you will spend a lot of time gluing things!

img

Besides all the challenges, bigdata is a super interesting and exciting field to work, and most of the skills for bigdata are in list of the 25 Hottest Skills That Got People Hired in 2014 - not bad ;)

Books recommendations

There are hundreds of projects and tools evolving really quick, so I recommend you learn the concepts and papers that are the sources where projects are based. These will give you the foundation to solve the problems when they arise.

{% raw %}



FREE: Distributed systems for fun and profit
A really good introduction to core concepts of distributed systems that any bigdata engineer must understand and reason about.
  

Designing Data-Intensive Applications
Very good introductions to concepts and comparison between approaches and technologies.
{% endraw %}

Projects and Solutions

Zooming in on the systems and technologies you will probably handle on your bigdata application.

Storage

All your data must be available to be processed, so where and how you store your data is very important to later processing. A heads-up: don’t treat this as a normal filesystem, consistency here is a issue, mostly when you have one job that finish writing and you want read the data immediately.

  • HDFS - the most common distributed filesystem to work on bigdata.

  • Amazon AWS S3 - to avoid maintain our own hdfs, we use S3 as our primary storage, which is affordable and most of new tools have integrations with it.

For processing

How can you transform the storage data into something meaningful?

  • Apache Hadoop - represents the icon of bigdata, most used and known framework, it’s important to understand MapReduce and why its a powerful concept, where you can divide the a task in parts, parallelizing it among several nodes. Also the ecosystem around Hadoop is interesting, I personally used Hive and Scalding, and I think you should at least read about Pig. That said, we are migrating most of our workload from Hadoop to Apache Spark, more on that below.

  • Apache Storm - it is a great project for realtime processing, have great abstractions and a thriving community, we are dropping support to it, due the amount of gluing necessary to work together with Hadoop - it is not a deficiency of this project at all, it is just we felt Spark reduce the amount of abstractions you have to learn and maintain.

  • Apache Spark - leveraging the Hadoop ecosystem, a great foundation from Berkley AMPLab and a set of built-in libraries, makes me see Spark as a powerful replacement for Hadoop and yet reducing the development and operational complexity. Using Spark we have been able to ditch Hadoop and Storm and work on the same technology for Batch and Streaming processing, Hive-SQL support included. Conduct experiments with Machine Learning are also supported with MLlib. I wrote a few posts about Spark.

Inter-System Communications

Understand how data can travel trough your solution and what guarantees do you have available for high-availability, resilience and latency.

Distributed Datastore

There is space for SQL and NoSQL on the organization, we use what fits best for the job. This is not a extensive list, there are tons of datastore engines and I left a lot out, those are what we have production experience with.

  • Key-Value / Columnar - Apache Cassandra - a linear scalable datastore, we use mostly for time series or use profile, which are use cases that works well on a KV database. We complement the absence of GROUP BY creating materialized views with Apache Spark.

  • In-Memory Data Structures - Redis - very fast datastore with interesting data structures, it can make a lot more than be a web-cache. We use it on our realtime dashboards, updating all metrics each second.

  • Document - MongoDB - it is a very flexible schema-less datastore. We use mostly to keep track of accounts and users, where the properties available for each record can vary a lot.

If you are storing text, soon or later you will need to search it:

Workflows, Schedulers, Control and “glue”

There is a huge difference between launching a process manually and make it work everyday reliably. Scheduler come to rescue for this task. Sending data in/out of your bigdata cluster is another important topic to keep in mind.

  • Ooozie - Apache Oozie Workflow Scheduler for Hadoop

  • Azkaban - A batch job scheduler from LinkedIn

  • Flume - efficiently collect, aggregate, and move large amounts of log data, we also use to sink data from the our crawlers into S3.

  • Sqoop - transfer bulk data between Apache Hadoop and structured datastores such as relational databases.

  • Luigi - build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization.

Acknowledge

Thanks to my colleague Juan Pampliega who helped me giving books references and general ideas.

What is your experience/opinion?

What do you think about this stack? What would you add or remove from it? I would love to hear from you and learn more about others experience!

Gustavo Arjones

Always Learning, Geek, Curious