introduction to jq for data engineering

Data Engineer’s Lunch #19: Introduction to jq for Data Engineering

In Data Engineer’s Lunch #19: Introduction to jq for Data Engineering, we introduce jq and discuss how we can use it for data engineering. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

Introduction

In Data Engineer’s Lunch #19: Introduction to jq for Data Engineering, we introduce jq and discuss how we can use it for data engineering. jq is a command-line tool like sed for JSON data and can be used to slice, filter, map, and transform structured data. If you missed our talk on sed and/or awk, find them on youtube.com/anantcorp.

jq is written in C and has no runtime dependencies, so we know it is going to be fast. Some typical use cases for jq include parsing and manipulating JSON data and more specifically things like extracting a particular field of an object, converting a number to a string, or various other standard tasks.

jq works by taking an input and parsing it as a sequence of whitespace-separated JSON values, which are then passed through the provided filter(s) one at a time. The output of the filter(s) are written to STDOUT as a sequence of whitespace-separated JSON data. jq has pretty in-depth documentation and examples, so it is definitely worth it to check it out. Also in the documentation is a fairly large list of built-in operations and functions, which just goes to show how robust the tool is.

The basic syntax for a jq command is as follows:

jq -options 'program' input-file

Now, we will move onto the walkthrough portion. If you want to watch a live recording of the walkthrough, it is embedded below. We use this GitHub repository for the walkthrough, or you can follow along below.

Walkthrough

We recommend going through this walkthrough in Gitpod as Gitpod will have everything we need for this walkthrough. Hit the button below to get started!

If you prefer to do this locally, then you may need to download the latest version of jq if you do not have it on your local OS.

1. Basic Printing

1.1 – Print all

The simplest jq program is the expression ., which takes the input and outputs it as is.

1.1.1 – Print using input file cars.json.

jq '.' cars.json 

1.1.1 – Print using output of pipe.

cat cars.json | jq '.'

1.2 – Length of JSON .

jq 'length' cars.json 

2. Select Object

2.1 – Print first object.

jq '.[0]' cars.json

2.2 – Print 3rd and 5th objects.

jq '.[2], .[4]' cars.json

3. Print Keys and Values

The | operator combines two filters by feeding the output(s) of the one on the left into the input of the one on the right. It’s pretty much the same as the Unix shell’s pipe, if you’re used to that.

3.1 – Print Keys.

jq '.[] | keys' cars.json

3.2 – Select value of key of objects nested in array

jq '.[].id' cars.json

3.3 – Select values of multiple keys

3.3.1 – Top level

jq '.[] | .id, .location' cars.json

3.3.2 – Nested

jq '.[].car | .make, .model, .year' cars.json

4. Change values

jq '.[].location |= "Virginia Beach"' cars.json 

5. Select based on filters

5.1 – Based on make

jq '.[].car | select(.make == "Buick")' cars.json 

5.2 – Based on year

jq '.[].car | select(.year > 2000)' cars.json 

6. Delete Keys

6.1 – Delete Single Key

jq 'del(.[].id)' cars.json

6.2 – Delete multiple nested keys.

jq 'del(.[].car.year) | del(.[].car.make)' cars.json

7. Potential Real World Application

The next part of the walkthrough will focus on a scenario where someone could use jq for data engineering/wrangling in a “real world setting”. At the time of creating this repo, the 2021 MLB season has just started up. Nowadays, data is heavily ingrained into professional sports, but baseball especially with Sabermetrics (list of just offensive statistics). Moving forward, we will use jq to do some data engineering/wrangling and give our analysts some basic statistics to work with.

7.1 – Get roster of the Phillies

curl 'http://lookup-service-prod.mlb.com/json/named.roster_40.bam?team_id='143'' | jq '.roster_40.queryResults.row' > phillies.json

7.2 – Sort / Group By

We could sort or group by to do some filtering; however, we don’t use it moving forward. Here is what those options could look like though.

7.2.1 – Sort by position. sort_by(foo) compares two elements by comparing the result of foo on each element.

jq 'sort_by(.position_txt)' phillies.json

7.2.2 – Group by position. group_by(.foo) takes as input an array, groups the elements having the same .foo field into separate arrays, and produces all of these arrays as elements of a larger array, sorted by the value of the .foo field.

jq 'group_by(.position_txt)' phillies.json

7.3 – Filter out pitchers

-s: Read the entire input stream into a large array

jq '.[] | select(.position_txt == "P")' phillies.json | jq -s '.' > pitchers.json

7.4 – Create pitchers CSV

The script file toCSV.jq does the following:

  1. Take the array input containing all the different keys and set them as the columns
  2. For each object in the array input, map the columns to the corresponding values in the object and set them as the rows.
  3. Put the column names before the rows, as a header for the CSV, and pass the resulting row stream to the @csv filter.
  4. Use the -r flag to get the result as a raw string
jq -r -f toCSV.jq pitchers.json > pitchers.csv

7.5 – Get stats for the pitchers in the rotation who are either on our team or we are facing

7.5.1 – Aaron Nola (SP).

curl 'http://lookup-service-prod.mlb.com/json/named.sport_career_pitching.bam?league_list_id=%27mlb%27&game_type=%27R%27&player_id=%27605400%27' | jq '.sport_career_pitching.queryResults.row' > nola.json

7.5.2 – Zack Wheeler (SP2).

curl 'http://lookup-service-prod.mlb.com/json/named.sport_career_pitching.bam?league_list_id=%27mlb%27&game_type=%27R%27&player_id=%27554430%27' | jq '.sport_career_pitching.queryResults.row' > wheeler.json

7.5.3 – Zach Eflin (SP3).

curl 'http://lookup-service-prod.mlb.com/json/named.sport_career_pitching.bam?league_list_id=%27mlb%27&game_type=%27R%27&player_id=%27621107%27' | jq '.sport_career_pitching.queryResults.row' > eflin.json

7.6 – Add their names to their stats

7.6.1 – Aaron Nola (SP).

jq '.full_name = "Aaron Nola"' nola.json > nola.json.tmp && mv nola.json.tmp nola.json 

7.6.2 – Zack Wheeler (SP2).

jq '.full_name = "Zack Wheeler"' wheeler.json > wheeler.json.tmp && mv wheeler.json.tmp wheeler.json 

7.6.3 – Zach Eflin (SP3).

jq '.full_name = "Zach Eflin"' eflin.json > eflin.json.tmp && mv eflin.json.tmp eflin.json 

7.7 – Combine their data into one JSON array

jq -s '.' nola.json eflin.json wheeler.json > rotation.json

7.8 – Generate rotation CSV file

jq -r -f toCSV.jq rotation.json > rotation.csv

We can then hand this CSV back to our analytics team and they can then use it to potentially gameplan for the series with / against these pitchers. With that, we will wrap up our walkthrough on basic jq operations as well as a potential real-world scenario in which we can use a tool like jq to do some fast data engineering/wrangling.

Again, if you want a more in-depth discussion and live walkthrough, be sure to check out the embedded YouTube video below. If you missed last week’s Data Engineer’s Lunch #18: Luigi for Scheduling, be sure to check that out as well.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!