Sometimes you just need data to learn how a algorithm works, to run a stress test or just to have a excuse to spin up several machines in a cluster and see how it crush the data. More often than not, it is incredibly hard to obtain data, and a few colleagues I’ve talked about had similar problem, so this post is a collection of links and references for datasets I know have been open source. Please contribute =)Continue Reading »
Although tagcloud seems a little bit outdated and criticized visualization format, I have no doubt it can be useful sometimes. And if you can create one with only a few key strokes it is pretty sweet. Below I’ll show the technic of extracting Twitter #hashtags but you can use this technic to virtually any text source.Continue Reading »
This post could also be called Reading
.gz.tmp files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from
.gz, therefore those files are unavailable to be read by
SparkContext.textFile API. This post presents our workaround to process those files.
Recently I built an environment to help me to teach Apache Spark, my initial thoughts were to use Docker but I found some issues specially when using older machines, so to avoid more blockers I decided to build a Vagrant image and also complement the package with Apache Zeppelin as UI.
This Vagrant will build on Debian Jessie, with Oracle Java, Apache Spark 1.4.1 and Zeppelin (from the
Several tutorials have an assumption you own a data set. Often that is not the case and you just can’t take advantage of the tutorial because you don’t have data to play along. To comply with social networks Terms and Conditions you can’t publish your data sets, but you can create your own! Follow through these few commands.Continue Reading »
This posts shows how to create heatmaps of conversations taking place on Twitter, this is a proof of concept technic to learn more about our current datasets, this knowledge would be latter applied to the product development cycle. My objective here is to share a simple way to create a quick visualization and be able to make an internal demo.Continue Reading »
A few days ago I received an email from a student of Universidad Tecnológica Nacional asking me for advice about what kind of skills he needed acquire to be hired as Big Data Engineer, I felt it was something worth writing about and hopefully it can generate a sane debate and help more people.Continue Reading »
The book “DATA + DESIGN | a simple introduction to preparing and visualizing information” is a excellent reference to create visualization to several types of data, it guides you through simple and complex data with very clear Dos and Don’ts tips. On top of all it is free.Continue Reading »
Often we have to work with JSON data sets, now and then data comes on CSV format. I received a great tip from @diegodellera who told me about textql - Execute SQL against structured text like CSV or TSV.Continue Reading »
I took the TRACK B:Advanced Apache Spark Workshop and I can say it was really great learn more about Spark internals and its libraries. The Databricks’ team were awesome. All slides and training material are already online: Spark Summit 2014 Training.Continue Reading »