Data wrangling with Shell Scripts
May 1, 2021
.csvfile are a few GBs in size. This is a two part problem, not only is the file really large to load in memory, a text file which is a couple of GBs usually means millions of rows of data. When you run aggregation functions its going to take a lot of time. Now there are a couple of options available to you.
.csvfile into a lightweight database like MySQL. Run some queries for cleansing and analysis. Export the file back to
.csvYou will have to learn query language which is very different from your usual programming languages. Also a lightweight RDBMS isn’t exactly very fast.
Enter Bash command language
We here at ZASTI, use Linux as our primary operating system. The unix shell comes with a command language which is very versatile. It was built for text manipulation. It is so good that if you are using Windows 10, you can enable the Unix Subsystem for Windows provided by Microsoft. Here is a link on how to Install the Windows Subsystem for Linux.
Lets get some data samples from kaggle to work with. I am using the PUBG Match Deaths and Statistics.
Download into a working directory and unzip the files into a directory named
Let’s switch into the folder and list its contents by executing the following command.
cd pubg-match-deaths && ls -la
There are four files inside the folder. Two
jpg files which has the maps of pubg matches which we don’t need and two other
zip files, one for aggregate data and another one for the death stats. Let’s extract these two files into the current directory.
unzip aggregate.zip && unzip deaths.zip
If you notice now there are 10 files in total, 5 for the aggregate and 5 for death statistics. We have to merge these files before we can start working on them.
Before we merge, lets make sure that the files are similar and are formatted correctly. We can use the
head command with the
-n option to see the first 5 lines.
head -n 5 agg_match_stats_0.csv
The first line printed is the header row and following four lines are data rows. Let’s quickly check whether all files have same columns or not.
Lets list the columns in the first file
head -n 1 agg_match_stats_0.csv | tr ‘\,’ ‘\n’
| or pipe operator allows us to pass the results of the first command as input to the second command on its right side. You can chain as many
| operators as you want.
In the above example we have passed the output of the
head command to the
tr stands for translate, it allows us to find a string in the input and replace with another, in this case find all
, and replace with
\n which is the newline character. The result is a nice list of all the column names.
To get a count of the number of columns, we can use the word count
wc command with the
-l option to count the number of lines in the input.
head -n 1 agg_match_stats_0.csv | tr ‘\,’ ‘\n’ | wc -l
This tells us that there are 15 columns in the data file.
To check if all the files have column names and if the columns are in the same order.
head -n 1 agg*.csv
Merging the Files
Before we merge these files, lets get a count of the number of rows in each of these. This will help us to know if the files have merged correctly.
We can use the
wc -l command again to count the number of lines in the files.
wc -l agg*.csv
This gives us the following counts
Each of the files have one header row, the final file should have only one. So the merged file should have 67369236–3 (the extra header rows) in Total 67369236 rows.
For merging one file into another we can use the > operator to write the result of our commands to a file.
cat agg_match_stats_0.csv >> aggregate_match_stats_final.csv
will create a new file
aggregate_match_stats_final.csv and write all contents of
agg_match_stats_0.csv in to the new file
but now to add the contents of the second file, we run
cat agg_match_stats_1.csv > aggregate_match_stats_final.csv
We notice that the content of the
aggregate_match_stats_final.csv file is replaced with the content of
agg_match_stats_1.csv but we want to append the content and not replace.
For this we should have used the
>> operator in the commands for the second file and later. For example
cat agg_match_stats_1.csv >> aggregate_match_stats_final.csv
Lets remove the merged file and try and again
cat agg*.csv >> aggregate_match_stats_final.csv
This will merge all 5 aggregate files into
aggregate_match_stats_final.csv To quickly verify, lets check the number of lines in the merged file.
wc -l aggregate_match_stats_final.csv
This gives us the result
The 3 extra header rows weren’t removed.
What we want to do is add the first row of the first file for which we will use the
head command, and all contents of all 5 files except for the first row, for this we will use the
tail command with the
-n +2 option to skip the header row and
-q the quiet option so that it doesn’t print meta information such as the file name, etc.
head -n 1 agg_match_stats_0.csv > aggregate_match_stats_final.csv && tail -n +2 -q agg_match_stats_*.csv >> aggregate_match_stats_final.csv
Again to quickly verify, we run
wc -l aggregate_match_stats_final.csv
This gives us the result
This is the right count.
Now, let’s do the same for kill stats.
head -n 1 kill_match_stats_final_0.csv > kills_match_stats_final.csv && tail -n +2 -q kill_match_stats_final_*.csv >> kills_match_stats_final.csv
Finally we have the concatenated file for both aggregated data and kills stats. We can now do some basic Data analysis.
While analysing data, we work with a single column at a time. For example we may want to find the max of a column. For this we will use the
cut command with the
-d delimiter option and
-f fields option to extract a single column.
cut -d, -f 1 aggregate_match_stats_final.csv | sort -nr | head -1
Here we have extracted the first column
-f 1, sorted it numerically in reverse order using
-nr option in the
sort command. This command gives the first element as the max of that column. In this case it happens to be a date column, and we get the result
If you want to find the minimum of the column, you will just run the same command without the
cut -d, -f 1 aggregate_match_stats_final.csv | sort | head -1
Applying the commands we have learned so far in different combinations, we can do some quick analysis to find that for the column
player_kills the smallest value is
0 which is quite expected and the largest value is
62. Someone managed 62 kills in a single game.
To find out who managed 62 kills in a single game, we would run
cut -d, -f11,12 aggregate_match_stats_final.csv | grep 62,
grep command followed by the text
62, finds all the rows which have the exact matched string
There are 3 players who managed 62 kills in a single game.
I hope this article gave you an intro to using shell commands for data wrangling. If you are interested in mastering shell scripting or just learn a little bit more on using it for data wrangling. Here are some recommendations
Tools of the trade we have seen so far
>>— operator to concatenate files
head— to get the first n rows of a file
tail— to get the last n rows of a file
|— operator to chain commands
wc— word count, can count lines with
tr— translate command
sort— sort the data
cut— filter columns
grep— filter rows
Some advanced commands for further lookup
sed— stream editor for insertion, deletion, search and replace substrings
uniq— find unique values
paste— join 2 files (line by line)
split— split a file into pieces (less useful)