Since I got to know News.me it surprises me with its simplicity and yet power to recommend me the best stories to read. I always thought it would be fun try to build something similar. So I decided to create a PoC of Twitter’s top stories using Apache Spark.
DISCLAIMER: this is a PoC, mainly focused on learning Spark, this architecture doesn’t represent a production level product neither I consider recommending stories for only one user as a big data problem.
The solution is based on two stages:
Collect tweets from the stream, analize them and store those tweets whthat contains a link, expanding the link to its final destination (removing shortening and click counters)
Run a batch job to process the data from previous stage and create a top 10 list
Building & Running
I use Eclipse + Scala and typesafe plugin sbteclipse to create an Eclipse project.
src/main/resources/twitter4j.properties.template adding your credentials and rename it to
twitter4j.oauth.consumerKey= // twitter4j.oauth.consumerSecret= // twitter4j.oauth.accessToken= // twitter4j.oauth.accessTokenSecret= //
Let’s build our project:
1 2 3 4
And run it:
1 2 3 4
following.txt adding accounts that you find interesting!
If everything is working properly, each 5 minutes you going to see a new folders at
./urls/999999999/ the numbers represent the unix timestamp, rounded down to minute. For example:
urls/1404627900/ --> 7/5/2014 11:25:00 PM GMT-7
You can check the activity, using Spark UI
StreamingContext with a 5 minutes window, load the accounts and create the Twitter Stream
1 2 3
For each Tweet that contains a URL, extract it and if there are more than one url, extracts only the first:
Use SparkSQL to implicit convert a RDD[TweetContent] into a Schema RDD:
So we can go through each SchemaRDD and saveAsParquet to disk
1 2 3 4 5 6
Leave this code running for a few hours so we have data for the next part, which involves calculating the best stories!