Data wrangling with Shell Scripts

Zasti

May 1, 2021

At ZASTI our day to day work revolves around handling, cleansing and analysing large data files usually in .csv format.
Most of the times these .csv file are a few GBs in size. This is a two part problem, not only is the file really large to load in memory, a text file which is a couple of GBs usually means millions of rows of data. When you run aggregation functions its going to take a lot of time. Now there are a couple of options available to you.
The standard method is to load the file into a jupyter notebook using pandas. Assuming that you are using a standard laptop/desktop <16 GB of RAM with a Core i3 maybe a Core i5 processor, this is going to take sometime to finish.[/et_pb_text][/et_pb_column][/et_pb_row][et_pb_row _builder_version="4.4.3" width="60%" max_width="1080px" custom_margin="||||false|false"][et_pb_column type="4_4" _builder_version="4.4.3"][et_pb_text _builder_version="4.4.3" text_font_size="17px" text_line_height="1.8em"]Another option is to import the .csv file into a lightweight database like MySQL. Run some queries for cleansing and analysis. Export the file back to .csv You will have to learn query language which is very different from your usual programming languages. Also a lightweight RDBMS isn’t exactly very fast.
Is there a simpler and faster way to do some simple data wrangling?

Enter Bash command language

We here at ZASTI, use Linux as our primary operating system. The unix shell comes with a command language which is very versatile. It was built for text manipulation. It is so good that if you are using Windows 10, you can enable the Unix Subsystem for Windows provided by Microsoft. Here is a link on how to Install the Windows Subsystem for Linux.

Each of us use a flavour of linux that we are most comfortable with. Most linux systems by default comes with the bash shell and command language. In this article we are going show you how to use bash shell commands for some basic data wrangling. There are other shells as well like zsh, fish, etc. which you can try on your own.

Sample Data

Lets get some data samples from kaggle to work with. I am using the PUBG Match Deaths and Statistics.

Download into a working directory and unzip the files into a directory named pubg-match-deaths.

Let’s switch into the folder and list its contents by executing the following command.

cd pubg-match-deaths && ls -la

There are four files inside the folder. Two jpg files which has the maps of pubg matches which we don’t need and two other zip files, one for aggregate data and another one for the death stats. Let’s extract these two files into the current directory.

unzip aggregate.zip && unzip deaths.zip

If you notice now there are 10 files in total, 5 for the aggregate and 5 for death statistics. We have to merge these files before we can start working on them.

Before we merge, lets make sure that the files are similar and are formatted correctly. We can use the head command with the -n option to see the first 5 lines.

head -n 5 agg_match_stats_0.csv

The first line printed is the header row and following four lines are data rows. Let’s quickly check whether all files have same columns or not.

Lets list the columns in the first file

head -n 1 agg_match_stats_0.csv | tr ‘\,’ ‘\n’

The | or pipe operator allows us to pass the results of the first command as input to the second command on its right side. You can chain as many | operators as you want.

In the above example we have passed the output of the head command to the tr command. tr stands for translate, it allows us to find a string in the input and replace with another, in this case find all , and replace with \n which is the newline character. The result is a nice list of all the column names.

To get a count of the number of columns, we can use the word count wc command with the -l option to count the number of lines in the input.

head -n 1 agg_match_stats_0.csv | tr ‘\,’ ‘\n’ | wc -l

This tells us that there are 15 columns in the data file.

To check if all the files have column names and if the columns are in the same order.

head -n 1 agg*.csv
This will list the first row of the aggregate files one under the other. A quick glance tells us that all the aggregate files contain data headers, the same columns in the same order.

Merging the Files

Before we merge these files, lets get a count of the number of rows in each of these. This will help us to know if the files have merged correctly.

We can use the wc -l command again to count the number of lines in the files.

wc -l agg*.csv
This will list the first row of the aggregate files one under the other. A quick glance tells us that all the aggregate files contain data headers, the same columns in the same order.

This gives us the following counts


13849288 agg_match_stats_0.csv
13844276 agg_match_stats_1.csv
13841505 agg_match_stats_2.csv
13840681 agg_match_stats_3.csv
11993486 agg_match_stats_4.csv
67369236 total

Each of the files have one header row, the final file should have only one. So the merged file should have 67369236–3 (the extra header rows) in Total 67369236 rows.

For merging one file into another we can use the > operator to write the result of our commands to a file.

For example

cat agg_match_stats_0.csv >> aggregate_match_stats_final.csv

will create a new file aggregate_match_stats_final.csv and write all contents of agg_match_stats_0.csv in to the new file

but now to add the contents of the second file, we run

cat agg_match_stats_1.csv > aggregate_match_stats_final.csv

We notice that the content of the aggregate_match_stats_final.csv file is replaced with the content of agg_match_stats_1.csv but we want to append the content and not replace.

For this we should have used the >> operator in the commands for the second file and later. For example

cat agg_match_stats_1.csv >> aggregate_match_stats_final.csv

Lets remove the merged file and try and again

rm aggregate_match_stats_final.csv
cat agg*.csv >> aggregate_match_stats_final.csv

This will merge all 5 aggregate files into aggregate_match_stats_final.csv To quickly verify, lets check the number of lines in the merged file.

wc -l aggregate_match_stats_final.csv

This gives us the result

67369236 aggregate_match_stats_final.csv

The 3 extra header rows weren’t removed.

What we want to do is add the first row of the first file for which we will use the head command, and all contents of all 5 files except for the first row, for this we will use the tail command with the -n +2 option to skip the header row and -q the quiet option so that it doesn’t print meta information such as the file name, etc.

head -n 1 agg_match_stats_0.csv > aggregate_match_stats_final.csv && tail -n +2 -q agg_match_stats_*.csv >> aggregate_match_stats_final.csv

Again to quickly verify, we run

wc -l aggregate_match_stats_final.csv

This gives us the result

67369233 aggregate_match_stats_final.csv

This is the right count.

Now, let’s do the same for kill stats.

head -n 1 kill_match_stats_final_0.csv > kills_match_stats_final.csv && tail -n +2 -q kill_match_stats_final_*.csv >> kills_match_stats_final.csv

Finally we have the concatenated file for both aggregated data and kills stats. We can now do some basic Data analysis.

Data Analysis

While analysing data, we work with a single column at a time. For example we may want to find the max of a column. For this we will use the cut command with the -d delimiter option and -f fields option to extract a single column.

cut -d, -f 1 aggregate_match_stats_final.csv | sort -nr | head -1

Here we have extracted the first column -f 1, sorted it numerically in reverse order using -nr option in the sort command. This command gives the first element as the max of that column. In this case it happens to be a date column, and we get the result

2018–01–10T12:02:16+0000

If you want to find the minimum of the column, you will just run the same command without the -nr flags

cut -d, -f 1 aggregate_match_stats_final.csv | sort | head -1

Applying the commands we have learned so far in different combinations, we can do some quick analysis to find that for the column player_kills the smallest value is 0 which is quite expected and the largest value is 62. Someone managed 62 kills in a single game.

To find out who managed 62 kills in a single game, we would run

cut -d, -f11,12 aggregate_match_stats_final.csv | grep 62,

the grep command followed by the text 62, finds all the rows which have the exact matched string 62,

62,gogolnyg
62,KJ-Qun_326373092
62,l56624

There are 3 players who managed 62 kills in a single game.

I hope this article gave you an intro to using shell commands for data wrangling. If you are interested in mastering shell scripting or just learn a little bit more on using it for data wrangling. Here are some recommendations

Tools of the trade we have seen so far

>>— operator to concatenate files

head — to get the first n rows of a file

tail — to get the last n rows of a file

| — operator to chain commands

wc — word count, can count lines with -l

tr — translate command

sort — sort the data

cut — filter columns

grep — filter rows

Some advanced commands for further lookup

sed — stream editor for insertion, deletion, search and replace substrings

uniq — find unique values

paste — join 2 files (line by line)

split — split a file into pieces (less useful)

Further reading