introduction to sed for data engineering

Introduction to sed for Data Engineering

In this blog, we will introduce sed, a stream editor, and how we can use this tool for data engineering. We also include a walkthrough of a demo you can use to learn the basics. You can also check out this GitHub repository for the walkthrough instructions.

sed is a stream editor that is commonly used to take text input, perform operations on the input, and output modified text for fast data engineering. These inputs can be from a file or from an incoming data pipeline. sed also supports basic and extended regular expressions that allow you to match complex patterns. We will show examples of both in the walkthrough section. sed can also be used for extracting part of a file using pattern matching or substituting multiple occurrences of a string within a file which can make a data engineer’s life much easier. Again examples of this will be shown in the walkthrough below. Additionally, you can head straight to the GitHub repo for the walkthrough.

sed works by sequentially reading a file, line by line, into memory. It then performs all actions specified for the line and places the line back in memory to dump to the terminal with the requested changes made. After all actions have taken place to this one line, it reads the next line of the file and repeats the process until it is finished with the file. The default output is to display the contents of each line on the screen. The output can also be redirected to another file leaving the input file unchanged. Additionally, the output can be used to change the input file itself and that will be covered in the walkthrough below. The default is for sed to read the entire file and make changes to each line within; however, it can be restricted to specified lines as needed, which will be discussed in the walkthrough.

sed utilizes by this base syntax:

sed [options] '[command]' [filename]

And building off of this, we will go through this demo to show you how you can do some basic operations with sed for data engineering. In the walkthrough, we will cover how to read lines, delete lines, substitute, and write using sed. The walkthrough is broken into 2 segments, with the first being learning basics, and the second is applying it to a real-world situation.

First, you will need to go to this GitHub repo. You can open the repo directly in Gitpod by hitting the “Open in Gitpod” button, which we recommend so you do not have to download any files and/or software to your local computer. If you prefer to try this out locally, then you will need to clone the repo and download and install GNU sed 4.7, or the latest if you do not already have it.

We will begin with reading lines using sed. You can run any of these following commands as they will do the same thing, but they have different syntax. The flag -n is used to suppress output, and the p command is used to print specific lines.

In this version, we are using a piped input from the cat command and then printing only lines 5-8 of lyrics.txt.

cat lyrics.txt | sed -n '5,8p'

The next version does the same; however, we use the input file directly verses reading it with cat.

sed -n '5,8p' lyrics.txt

The third version also does the same thing; however, when using the + symbol, we are telling sed to start at line 5 and the operate on the next 3 lines as well.

sed -n '5,+3p' lyrics.txt

Next, we will take a look at deleting lines. Similar to reading lines, we can use various ways to do the same thing like so:

cat lyrics.txt | sed '5,8d'
sed '5,8d' lyrics.txt
sed '5,+3d' lyrics.txt

Additionally, we can also delete lines that contain specific patterns. This command will delete all lines that contain the pattern thunder.

sed '/thunder/ d' lyrics.txt 

We could additionally combine commands to run multiple commands at the same time. We could do this with 2 methods: -e and ;.

To combine commands with -e, we can do the following. This combination command will act on both the thunder and magic patterns and delete the lines that contain them.

sed -e '/thunder/ d' -e '/magic/ d' lyrics.txt

The other method is to use ; like below. The commands are separated by ; and the commands are separated by line breaks. Additionally, the same command can be written in one line, which is shown below as well.

sed '
/thunder/ d;
/magic/ d
' lyrics.txt
sed '/thunder/ d; /magic/ d' lyrics.txt

Now we will cover substitutions. Substitutions work for the first occurrence of the chosen sequence in each line unless specified with a global option. If we run the below command, notice that only the first thunder in each applicable line changes to lightning and not both.

sed 's/thunder/lightning/' lyrics.txt

To make it change every thunder available, we can append a g option.

sed 's/thunder/lightning/g' lyrics.txt

Say we also want to change the capitalized Thunder, then we can also add an i option with the g option.

sed 's/thunder/lightning/gi' lyrics.txt

But wait, that also changed the Thundercats into lightningcats and we didn’t want to do that. We can make changes to the command to account for this. We can change the pattern to be Thunder, to avoid changing Thundercats into lightningcats. Additionally, we can combine the first and second substitution commands to do it all at once.

sed '
s/thunder/lightning/g;
s/Thunder,/Lightning,/g
' lyrics.txt

If we want to display only the lines we changed, then we can run the following command:

sed -n '
s/thunder/lightning/gp;
s/Thunder,/Lightning,/gp
' lyrics.txt

There are 8 lines in the output because it shows the output per command per line, so every first line in the pair is for the s/thunder/lightning/gp command and the second line of the pair is for the s/Thunder,/Lightning,/gp command.

Now, we can move onto writing the outputs we’ve generated so far. Currently, we have only been seeing changes in the terminal without actually changing the input file / pipe nor creating a new file from the output. We will cover 2 methods of writing using sed: -i and >.

The -i method overwrites the input file with the output. However, we can append the -i with a .bak to create a backup file so we do not lose the original input file.

sed '
s/thunder/lightning/g;
s/Thunder,/Lightning,/g
' lyrics.txt > edited.txt
cat edited.txt

In the output of edited.txt, we see that we were able to write the changes we made using the substitution commands.

To show how we can create a backup, we will revert the changes we made above and append a .bak to the -i option.

sed -i.bak '
s/lightning/thunder/g;
s/Lightning,/Thunder,/g
' edited.txt

Then if we run cat edited.txt, then we notice that it is the same as the original lyrics.txt file. If we run cat edited.txt.bak, then we see that it is the original edited.txt file before we made the reversion.

Another method for running sed commands is using the -f option and a script file. In this example, we will take an input pipe of lyrics.txt and then use the script file to make further edits. The input pipe uses the = command, which prints the current input line number with a trailing newline. If you look at script.txt, we have 3 lines. In sed script files, we do not use apostrophes, and each line is a separate command. The first line, N, appends the current and next line to the pattern space. The second line, s/\n/. /, substitutes the new line break with period and space. And the third line, G, adds double spacing. So by combining the input pipe and the sed commands, we should get back a double-spaced, numbered list of the lyrics.

sed = lyrics.txt | sed -f script.txt

And that covers some basic operations we can use sed for in things such as data engineering. Now, we can move onto a potential real-world application in which we could use sed for data engineering.

In this “real-world-example”, say we get a CSV file from someone in our company or team and it has some issues. They noticed that the first field is missing a value for at least 1 row, and they are unsure of how many other rows are affected. How can we approach this problem with sed?

First things first, we could do a quick grep to see how many rows are affected. We can do this by using a ^ to signify the beginning of a line and use it with a comma to be more specific.

grep "^," spacecraft_journey_catalog.csv 

Great, we can see that at least 10 rows are affected and are missing values for field 1 of the CSV. Now, how can we use this information?

The first thing we can do is create a file with the specific records that are missing values for field one. We can do this in 2 ways: grep input pipe or using the CSV as the input file. No matter the method we choose, the sed substitution command itself is the same for both. We will need to substitute the beginning of the line and comma (^,) with something like Missing Summary, and output only those specific lines using the p and n options. Then we can write the output to a file called missing_items.csv and then run cat missing_items.csv to visualize the newly created file.

grep "^," spacecraft_journey_catalog.csv | sed -n 's/^,/Missing Summary,/p' > missing_items.csv
cat missing_items.csv 
sed -n 's/^,/Missing Summary,/p' spacecraft_journey_catalog.csv > missing_items.csv
cat missing_items.csv 

Now that we have a file of the “corrupted” data, we could potentially also create a new CSV file that substitutes the rows with Missing Summary as a stopgap. Again, we could do this using the grep pipe, but we will just the input CSV moving forward. In the sed command, we do the same thing we did to substitute the empty values, but we do not want to isolate them hence no p or n options. We then take the entire output and write it into a new CSV called updated_items.csv. We then run a grep to confirm that the new CSV has used Missing Summary as a stopgap for the rows missing values for field one.

sed 's/^,/Missing Summary,/' spacecraft_journey_catalog.csv > updated_items.csv
grep "Missing Summary" updated_items.csv 

Another thing we could do with this initial CSV is to create a new CSV that lacks the “corrupted” rows. We will utilize the deletion commands we introduced in the first part of the walkthrough. We use the same ^, pattern to act on; however, instead of substitution, we are doing deletion. If you remember, when doing it the way we are, we are telling sed to delete the entire line if it contains the pattern. We can then confirm that the deletions occurred by running a grep on the newly created CSV, which should return nothing.

sed '/^,/ d' spacecraft_journey_catalog.csv > removed_items.csv
grep "^," removed_items.csv

And that will wrap up our walkthrough on basic sed operations as well as a potential real-world scenario in which we can use a tool like sed to do some fast data engineering.

Resources

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!