Everybody loves a good data visualization. Still, they shouldn’t leave the interpretation to the viewer, as it’s the case with histograms. Today we’ll answer how binning bias can mislead you in the analysis and how to prevent this issue with the power of ECDF plots.

The article answers the following questions:

  • What’s wrong with histograms — and when should you avoid them
  • How to replace histograms with ECDFs — a more robust method for examining data distributions
  • How to use and interpret multiple ECDFs in a single chart — to compare distributions among different data segments

Without much ado, let’s get started!

What’s wrong with histograms?

As Justin Bois from DataCamp said — binning bias — and I can’t agree more. What this means is that using different bin sizes on a histogram makes data distribution look different. Don’t take my word for it — the example below speaks for itself.

To start, we’ll import a couple of libraries for data analysis and visualization, and load the Titanic dataset straight from the web:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

#python #data-science #towards-data-science #machine-learning #artificial-intelligence

Step-By-Step Guide to ECDFs — A Robust Histogram Replacement
1.10 GEEK