Elton  Bogan

Elton Bogan


Visualizing the Nothing

How to visualize the incompleteness of a dataset with Python

It’s hard to know what to do with null values in our data. Most times, it’s easier to drop them and follow with what’s left.

But they can eventually have meaning and can be investigated, taking some time to have a good look at them can often bring a better understanding of how the data was collected and even reveal some patterns in it.

Image for post

Null values matrix

In this article, we’ll explore how to visualize all the NULLs in our datasets and get a look at what insights we can extract from doing so.

I’ll run the code in Jupyer Lab, and I’ll use PandasNumpyMissingno, and Matplotlib for this example.

The dataset will be California Jail Profile Survey, which contains monthly county-level data from 1995 to 2018.

import pandas as pd

f = 'data/california_jail_county_monthly_1995_2018.csv'
df = pd.read_csv(f)

After loading the dataset to Pandas, we can have a look at one of its convenient methods for dealing with Nulls.

We can use .isnull followed by a .sum and get the number of missing values.


Image for post

Null values count by column

That’s already useful since it gives us an idea of which fields we can rely on, but there are better ways of visualizing this, let’s try using Missingno.

Missingno is a library for visualizing incompleteness in a dataset, it works on top of Matplotlib and Seaborn, and it’s effortless to use.

import missingno as msno

We’ll start with a simple bar chart, instead of comparing that big list of numbers we’ll use rectangles and their sizes.

#data-visualization #null #data-science #data-analysis #python

Visualizing the Nothing