
Data wrangling with Shell Scripts
Zasti
May 1, 2021

.csv
format..csv
file are a few GBs in size. This is a two part problem, not only is the file really large to load in memory, a text file which is a couple of GBs usually means millions of rows of data. When you run aggregation functions its going to take a lot of time. Now there are a couple of options available to you..csv
file into a lightweight database like MySQL. Run some queries for cleansing and analysis. Export the file back to .csv
You will have to learn query language which is very different from your usual programming languages. Also a lightweight RDBMS isn’t exactly very fast.Enter Bash command language
We here at ZASTI, use Linux as our primary operating system. The unix shell comes with a command language which is very versatile. It was built for text manipulation. It is so good that if you are using Windows 10, you can enable the Unix Subsystem for Windows provided by Microsoft. Here is a link on how to Install the Windows Subsystem for Linux.
Sample Data
Lets get some data samples from kaggle to work with. I am using the PUBG Match Deaths and Statistics.
Download into a working directory and unzip the files into a directory named pubg-match-deaths
.
Let’s switch into the folder and list its contents by executing the following command.
cd pubg-match-deaths && ls -la
There are four files inside the folder. Two jpg
files which has the maps of pubg matches which we don’t need and two other zip
files, one for aggregate data and another one for the death stats. Let’s extract these two files into the current directory.
unzip aggregate.zip && unzip deaths.zip
If you notice now there are 10 files in total, 5 for the aggregate and 5 for death statistics. We have to merge these files before we can start working on them.
Before we merge, lets make sure that the files are similar and are formatted correctly. We can use the head
command with the -n
option to see the first 5 lines.
head -n 5 agg_match_stats_0.csv
The first line printed is the header row and following four lines are data rows. Let’s quickly check whether all files have same columns or not.
Lets list the columns in the first file
head -n 1 agg_match_stats_0.csv | tr ‘\,’ ‘\n’
The |
or pipe operator allows us to pass the results of the first command as input to the second command on its right side. You can chain as many |
operators as you want.
In the above example we have passed the output of the head
command to the tr
command. tr
stands for translate, it allows us to find a string in the input and replace with another, in this case find all ,
and replace with \n
which is the newline character. The result is a nice list of all the column names.
To get a count of the number of columns, we can use the word count wc
command with the -l
option to count the number of lines in the input.
head -n 1 agg_match_stats_0.csv | tr ‘\,’ ‘\n’ | wc -l
This tells us that there are 15 columns in the data file.
To check if all the files have column names and if the columns are in the same order.
head -n 1 agg*.csv
Merging the Files
Before we merge these files, lets get a count of the number of rows in each of these. This will help us to know if the files have merged correctly.
We can use the wc -l
command again to count the number of lines in the files.
wc -l agg*.csv
This gives us the following counts
13849288 agg_match_stats_0.csv
13844276 agg_match_stats_1.csv
13841505 agg_match_stats_2.csv
13840681 agg_match_stats_3.csv
11993486 agg_match_stats_4.csv
67369236 total
Each of the files have one header row, the final file should have only one. So the merged file should have 67369236–3 (the extra header rows) in Total 67369236 rows.
For merging one file into another we can use the > operator to write the result of our commands to a file.
For example
cat agg_match_stats_0.csv >> aggregate_match_stats_final.csv
will create a new file aggregate_match_stats_final.csv
and write all contents of agg_match_stats_0.csv
in to the new file
but now to add the contents of the second file, we run
cat agg_match_stats_1.csv > aggregate_match_stats_final.csv
We notice that the content of the aggregate_match_stats_final.csv
file is replaced with the content of agg_match_stats_1.csv
but we want to append the content and not replace.
For this we should have used the >>
operator in the commands for the second file and later. For example
cat agg_match_stats_1.csv >> aggregate_match_stats_final.csv
Lets remove the merged file and try and again
rm aggregate_match_stats_final.csv
cat agg*.csv >> aggregate_match_stats_final.csv
This will merge all 5 aggregate files into aggregate_match_stats_final.csv
To quickly verify, lets check the number of lines in the merged file.
wc -l aggregate_match_stats_final.csv
This gives us the result
67369236 aggregate_match_stats_final.csv
The 3 extra header rows weren’t removed.
What we want to do is add the first row of the first file for which we will use the head
command, and all contents of all 5 files except for the first row, for this we will use the tail
command with the -n +2
option to skip the header row and -q
the quiet option so that it doesn’t print meta information such as the file name, etc.
head -n 1 agg_match_stats_0.csv > aggregate_match_stats_final.csv && tail -n +2 -q agg_match_stats_*.csv >> aggregate_match_stats_final.csv
Again to quickly verify, we run
wc -l aggregate_match_stats_final.csv
This gives us the result
67369233 aggregate_match_stats_final.csv
This is the right count.
Now, let’s do the same for kill stats.
head -n 1 kill_match_stats_final_0.csv > kills_match_stats_final.csv && tail -n +2 -q kill_match_stats_final_*.csv >> kills_match_stats_final.csv
Finally we have the concatenated file for both aggregated data and kills stats. We can now do some basic Data analysis.
Data Analysis
While analysing data, we work with a single column at a time. For example we may want to find the max of a column. For this we will use the cut
command with the -d
delimiter option and -f
fields option to extract a single column.
cut -d, -f 1 aggregate_match_stats_final.csv | sort -nr | head -1
Here we have extracted the first column -f 1
, sorted it numerically in reverse order using -nr
option in the sort
command. This command gives the first element as the max of that column. In this case it happens to be a date column, and we get the result
2018–01–10T12:02:16+0000
If you want to find the minimum of the column, you will just run the same command without the -nr
flags
cut -d, -f 1 aggregate_match_stats_final.csv | sort | head -1
Applying the commands we have learned so far in different combinations, we can do some quick analysis to find that for the column player_kills
the smallest value is 0
which is quite expected and the largest value is 62
. Someone managed 62 kills in a single game.
To find out who managed 62 kills in a single game, we would run
cut -d, -f11,12 aggregate_match_stats_final.csv | grep 62,
the grep
command followed by the text 62,
finds all the rows which have the exact matched string 62,
62,gogolnyg
62,KJ-Qun_326373092
62,l56624
There are 3 players who managed 62 kills in a single game.
I hope this article gave you an intro to using shell commands for data wrangling. If you are interested in mastering shell scripting or just learn a little bit more on using it for data wrangling. Here are some recommendations
Tools of the trade we have seen so far
>>
— operator to concatenate files
head
— to get the first n rows of a file
tail
— to get the last n rows of a file
|
— operator to chain commands
wc
— word count, can count lines with-l
tr
— translate command
sort
— sort the data
cut
— filter columns
grep
— filter rows
Some advanced commands for further lookup
sed
— stream editor for insertion, deletion, search and replace substrings
uniq
— find unique values
paste
— join 2 files (line by line)
split
— split a file into pieces (less useful)