Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

Twitter Json Manipulation

Today I had to quickly find the most frequent Hashtags on my smallish dataset. After some research I just found a awesome shell tool to manipulate json: jq a json grep+sed+awk tool

With jq everything else was simple, just pipeline a few commands:

1
2
3
4
5
6
7
8
9
10
11
$ cat tweets.json | \
  jq -r '.entities.hashtags[].text' | sort | uniq -c | \
  sort -nr |


$ cat tweets.json | \
  jq '.text' | \  # select the text field on my JSON
  tr 'A-Z' 'a-z' | \ # convert text to lower case
  egrep -oe'#[0-9a-z_]+' | \ # select the hashtag
  sort | uniq -c | \ # count the number of different hashtags
  sort -nr | head -10 # reverse sort by frequency and get top 10

A couple of minutes later, the output was:

487 #bigdata
131 #java
 59 #analytics
 34 #truoptik
 33 #cloud
 24 #jobs
 16 #job
 15 #healthcare
 15 #hadoop
 15 #followfriday

Of course there are more scalable approaches, but for an small dataset it works just fine without any setup.