This posts shows how to create heatmaps of conversations taking place on Twitter, this is a proof of concept technic to learn more about our current datasets, this knowledge would be latter applied to the product development cycle. My objective here is to share a simple way to create a quick visualization and be able to make an internal demo.
TL;DR; The source code and interactive version is available below.
The render will looks like this:
This dataset contains tweets from the Brazilian Elections of 2014
Extract the points from tweets
Because information about coordinates is not present on all tweets, we need to filter out those tweets that doesn’t contains any geographic information.
Important: From Twitter docs: The inner coordinates array is formatted as geoJSON (longitude first, then latitude).
Note that we also converted the JSON to CSV so it will be easier to aggregate on a latter step.
As you can see, jq makes the task really straight forward:
1 2 3
Just to get some statistics about the presence of
coordinates, because it is opt-in field on the Twitter Client, I analyzed a dataset of 2M tweets and it had ~50,000 tweets with
coordinates, total of 2.5%.
1 2 3
The content of
coordinates-elections.csv looks like:
1 2 3 4 5 6 7 8
Now we have extracted 50.000 coordinates, but we can’t throw this amount of data to a front-end library and expect a sub-second response neither this is necessary.
I did a few arbitrary tests that weren’t very extensive but I found 300 points are good enough to have a great visualization and combined with a quick response time.
The simple technic I came up to aggregate this data was reducing the precision of coordinates, which won’t affect our visualization, once we can still pin point the hot zones on the map.
We will reduce the precision to 11.1Km which means keep only one decimal, this is possible because precision is related to the number os decimals you have on your coordinates.
A more complex approach would be to map all points next to each other in the same boxing fence and calculate the average, this would bring more precision but would not change much the big picture.
To help this task, I’ve created a couple of bash functions:
Include this functions on your bash shell and let’s work
Now the coordinates on your file only contains 1 digit precision, ie: 11.1Km
1 2 3 4 5
Once I have the
coordinates-elections-reduced.csv we can use common bash commands to get the top points:
First we will sort by the coordinates as numbers, group then together and count, finally get the highest 300 points. This pipeline is analogous to a SQL query:
The content of
points.csv file is:
1 2 3 4 5
Converting TSV to JSON
Finally, we just have to convert this TSV to JSON and transform all lines into a
single array, this is done with
-s option on
Now that I have the final information it is just a matter of selecting a visualization library.
What can you tell me?
- Following those steps have you been able to create this visualization?
- Have you learn something new about your data?
- Do you agree or disagree with this method? What would you do differently?
I would love to hear your story! Drop a comment below or talk to me on Twitter