Have you ever dealt with a big-scary CSV file that has many columns that you don’t want and many records that slow down the process for you to filter and get the desired information?
This tutorial is about using two command-line programs that can solve these problems; csvkit and xsv. We will compare the two at the end and see how performant each and when we can use one and not the other in terms of speed especially if we’re processing a large CSV file.
Let’s first download recent coronavirus data across the United States from COVID Tracking Project which is a volunteer organization dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the US. Btw, The data is published under a Creative Commons CC BY 4.0 license.
Let’s do this by downloading the CSV file manually or using curl:
$ curl -LO https://covidtracking.com/data/download/all-states-history.csv
**-LO **is a combination of **-L **and -O
Let’s first print what column names we have for this all-states.history.csvfile:
$ csvcut -n all-states-history.csv
1: date
2: state
3: dataQualityGrade
4: death
5: deathConfirmed
6: deathIncrease
7: deathProbable
8: hospitalized
9: hospitalizedCumulative
10: hospitalizedCurrently
11: hospitalizedIncrease
12: inIcuCumulative
13: inIcuCurrently
14: negative
15: negativeIncrease
16: negativeTestsAntibody
17: negativeTestsPeopleAntibody
18: negativeTestsViral
19: onVentilatorCumulative
20: onVentilatorCurrently
21: pending
22: positive
23: positiveCasesViral
24: positiveIncrease
25: positiveScore
26: positiveTestsAntibody
27: positiveTestsAntigen
28: positiveTestsPeopleAntibody
29: positiveTestsPeopleAntigen
30: positiveTestsViral
31: recovered
32: totalTestEncountersViral
33: totalTestEncountersViralIncrease
34: totalTestResults
35: totalTestResultsIncrease
36: totalTestsAntibody
37: totalTestsAntigen
38: totalTestsPeopleAntibody
39: totalTestsPeopleAntigen
40: totalTestsPeopleViral
41: totalTestsPeopleViralIncrease
42: totalTestsViral
43: totalTestsViralIncrease
As you can see, using **csvcut **with the option **-n **can list all the headers we have with their associated order which can help us select some specific columns that we’re interested in.
#command-line #covid19 #bash #data #data-science