Do you know which is the most popular data scientist meme? “Collecting Data is my Cardio”. Yes, that’s so true! Unlike what the beginner’s in the field of Data Science think, Data is seldom served to you in a clean & tidy tabular format.

In real world, the data needs to be gathered from multiple sources like flat files, zip files, databases, API’s, websites, etc. Not just sources but they could be in different formats, file structures such as .csv, .tsv, json etc and separated by different delimiters. How to make sense from this mess?

Image for post

To perform analysis and get accurate results from our data, you should first, be aware of the different formats and second, know how to import them into Python. In this blog, I am going to touch upon a few basic file structures that you might be familiar with and mention different yet common python libraries that are used for importing data.

Mastering these file formats is critical for your success in the field of data. So where does everyone begin with? Oh yes, the omnipresent csv file.

Reading CSV files in python

The most common flat file structure for storing data is the Comma Separated Values (CSV) files. It contains tabular data in plain text format separated by comma (,) delimiter. To identify a file format, you can usually look at the file extension. For example, a file saved with name “datafile” in “CSV” format will appear as “datafile.csv”.

Pandas library is used for reading the data from a csv file in python. The read_csv() function from the Pandas library is used to load the data and store it in a dataframe.

Reading other Flat files in python

Flat files are data files that contains records in tabular row-column format without any numbered structure. CSV is the most common flat file. But there are other flat file formats which contains data with a user specified delimiter such as tabs, space, colons, semi-colons, etc.

The Tab Separated Values (TSV) files are the second most common flat file structure. The same read_csv() function from the Pandas library is used for reading these files, you just need to specify the right separator to the parameter ‘sep’ of the read_csv() function.

#python-programming #data-wrangling #data-analytics #data-gathering #data-science #data analysisa

Gather Your Data: The Common ones!
1.30 GEEK