Have you ever dealt with a big-scary CSV file that has many columns that you don’t want and many records that slow down the process for you to filter and get the desired information?

This tutorial is about using two command-line programs that can solve these problems; csvkit and xsv. We will compare the two at the end and see how performant each and when we can use one and not the other in terms of speed especially if we’re processing a large CSV file.

Downloading COVID data from covidtracking

Let’s first download recent coronavirus data across the United States from COVID Tracking Project which is a volunteer organization dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the US. Btw, The data is published under a Creative Commons CC BY 4.0 license.

Let’s do this by downloading the CSV file manually or using curl:

$ curl -LO https://covidtracking.com/data/download/all-states-history.csv

**-LO **is a combination of **-L **and -O

  • **-L **is used to make sure if the URL has changed to another location, curl will redo the request on the new redirection link
  • **-O **this option is used to create an output file of the same name of the requested file name which is **all-states-history.csv **here

Printing the CSV file headers

Let’s first print what column names we have for this all-states.history.csvfile:

$ csvcut -n all-states-history.csv 
  1: date
  2: state
  3: dataQualityGrade
  4: death
  5: deathConfirmed
  6: deathIncrease
  7: deathProbable
  8: hospitalized
  9: hospitalizedCumulative
 10: hospitalizedCurrently
 11: hospitalizedIncrease
 12: inIcuCumulative
 13: inIcuCurrently
 14: negative
 15: negativeIncrease
 16: negativeTestsAntibody
 17: negativeTestsPeopleAntibody
 18: negativeTestsViral
 19: onVentilatorCumulative
 20: onVentilatorCurrently
 21: pending
 22: positive
 23: positiveCasesViral
 24: positiveIncrease
 25: positiveScore
 26: positiveTestsAntibody
 27: positiveTestsAntigen
 28: positiveTestsPeopleAntibody
 29: positiveTestsPeopleAntigen
 30: positiveTestsViral
 31: recovered
 32: totalTestEncountersViral
 33: totalTestEncountersViralIncrease
 34: totalTestResults
 35: totalTestResultsIncrease
 36: totalTestsAntibody
 37: totalTestsAntigen
 38: totalTestsPeopleAntibody
 39: totalTestsPeopleAntigen
 40: totalTestsPeopleViral
 41: totalTestsPeopleViralIncrease
 42: totalTestsViral
 43: totalTestsViralIncrease

As you can see, using **csvcut **with the option **-n **can list all the headers we have with their associated order which can help us select some specific columns that we’re interested in.

#command-line #covid19 #bash #data #data-science

How to Clean CSV Data at the Command Line
5.80 GEEK