# Querying Json Datasets With Jq

Working with JSON datasets is really common task nowadays, almost any API will output information on this format, but is still complex to manipulate this format when compared with plain-text combined with common unix commands like cut, awk, sed, etc.

To reduce this gap jq was developed with exactly this paradigm in mind jq is like sed for JSON data. This post will walk through the details to: select fields (projection), flatten arrays, filter jsons based on a field value and convert JSON to CSV/TSV.

## Installing jq

I’m using homebrew to install jq, so you just need to use:

$brew install jq Also you can download binaries for your platform or you can also play online jqplay.org ## Sample Data As sample data, I will use a few tweets I get from my timeline, those jsons are obtained using the Twitter API statuses/lookup.json. The dataset I’m using for the sample statuses.jaon ## Pretty Print Opening a json file with cat or less makes really hard to understand its content, even more if the file is compacted, as most the API does. It would looks like this: Just by passing the entire file to jq, we can have a much better view of file’s content, display first 3 lines of json:$ head -3 statuses.json | jq '.'

Now it is very clear simple to identify the json structure and work with its fields.

## Select Fields (Projection)

Now we can select some fields from the json, I’m interested to have the Twitter User and the Number of Followers:

This output is not that useful, I prefer to create a new json with the information I need:

Using the jq switcher -c you can have one-liner json, that can be useful to input for another process:

## Flatten Arrays

Sometimes the content we’re looking for is inside an array, it is really easy to parse this. As an example, we will list all urls inside tweets:

Or the images attached:

When I tried to do the parse .entities.media[] field I got some errors, it happened because not all tweets contains the field media, to avoid the error message we need to filter out fields with null vales

## Filter json based on a field value

With the command select we’re able to remove the jsons that doesn’t contains .entities.media element: select(.media != null)

## Everything together

This example selects only tweets made by me (user.screen_name == "arjones") and project the count of ReTweets, Favorites for each tweet:[]

## Convert JSON to CSV/TSV

There is a very long discussion on github to support CSV/TSV output My preferred trick to make it work is to convert the fields to an array and use the | @csv command. This example outputs the same content as above but in CSV format:

Once you have a CSV you can check these posts that can create an exploratory pipeline: