Datasets in Python

There are useful Python packages that allow loading publicly available datasets with just a few lines of code. In this post, we will look at 5 packages that give instant access to a range of datasets. For each package, we will look at how to check out its list of available datasets and how to load an example dataset to a pandas dataframe.

0. Python setup

I assume the reader (👀 yes, you!) has access to and is familiar with Python including installing packages, defining functions and other basic tasks. If you are new to Python, this is a good place to get started.

I have used and tested the scripts in Python 3.7.1 in Jupyter Notebook. Let’s make sure you have the relevant packages

installed before we dive in:

◼️ ️_pydataset_: Dataset package,
◼️ ️_seaborn_: Data Visualisation package,
◼️ ️_sklearn: Machine Learning package,
◼️ ️_statsmodel: Statistical Model package and
◼️ ️_nltk:_ Natural Language Tool Kit package

For each package, we will inspect the shape, _head _and _tail _of an example dataset. To avoid repeating ourselves, let’s quickly make a function:

## Create a function to glimpse the data
def glimpse(df):
    print(f"{df.shape[0]} rows and {df.shape[1]} columns")
    display(df.head())
    display(df.tail())

Alright, we are ready to dive in! 🐳

1. PyDataset

The first package we are going look at is PyDataset. It’s easy to use and gives access to over 700 datasets. The package was inspired by ease of accessing datasets in R and aimed to bring that ease in Python. Let’s check out the list of datasets:

## Import package
from pydataset import data

## Check out datasets
data()

Image for post

#data-science #data #python #dataset

0. Python setup

1. PyDataset

towardsdatascience.com

Datasets in Python