8 Top Books on Data Cleaning and Feature Engineering

Data preparation is the transformation of raw data into a form that is more appropriate for modeling.

It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.

Nevertheless, there are common data preparation tasks across projects. It is a huge field of study and goes by many names, such as “data cleaning,” “data wrangling,” “data preprocessing,” “feature engineering,” and more. Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.

Even though it is a challenging topic to discuss, there are a number of books on the topic.

In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.

Let’s get started.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Overview

The focus here is on data preparation for tabular data, e.g. data in the form of a table with rows and columns as it looks in an excel spreadsheet.

Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.

Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic. We will focus on these books.

I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:

Data Cleaning
Data Wrangling
Feature Engineering

I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Data Cleaning

Data cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.

The top books on data cleaning include:

Let’s take a closer look at each in turn.

“Bad Data Handbook”

The book “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” was edited by Q. Ethan Mccallum and was published in 2012.

Bad data is described not only as corrupt data but any data that impairs the modeling process.

It’s tough to nail down a precise definition of “Bad Data.” Some people consider it a purely hands-on, technical phenomenon: missing values, malformed records, and cranky file formats. Sure, that’s part of the picture, but Bad Data is so much more. […] Bad Data is data that gets in the way.

— Page 1, “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work,” 2012.

It is a collection of essays by 19 machine learning practitioners and us full of useful nuggets on data preparation and management.

Bad Data Handbook

The complete table of contents for the book is listed below.

Chapter 01: Setting the Pace: What Is Bad Data?
Chapter 02: Is It Just Me, or Does This Data Smell Funny?
Chapter 03: Data Intended for Human Consumption, Not Machine Consumption
Chapter 04: Bad Data Lurking in Plain Text
Chapter 05: (Re)Organizing the Web’s Data
Chapter 06: Detecting Liars and the Confused in Contradictory Online Reviews
Chapter 07: Will the Bad Data Please Stand Up?
Chapter 08: Blood, Sweat, and Urine
Chapter 09: When Data and Reality Don’t Match
Chapter 10: Subtle Sources of Bias and Error
Chapter 11: Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
Chapter 12: When Databases Attack: A Guide for When to Stick to Files
Chapter 13: Crouching Table, Hidden Network
Chapter 14: Myths of Cloud Computing
Chapter 15: The Dark Side of Data Science
Chapter 16: How to Feed and Care for Your Machine-Learning Expert
Chapter 17: Data Traceability
Chapter 18: Social Media: Erasable Ink?
Chapter 19: Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

I like this book a lot; it is full of valuable practical advice. I highly recommend it!

Learn More:

Bad Data Handbook, on Amazon.

“Best Practices in Data Cleaning”

The book “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” was written by Jason Osborne and was published in 2012.

This is a more general textbook on data preparation for computational-based social sciences rather than machine learning specifically. Nevertheless, it contains a ton of useful advice.

My goal in writing this book is to collect, in one place, a systematic overview of what I consider to be best practices in data cleaning—things I can demonstrate as making a difference in your data analyses. I seek to change the status quo, the current state of affairs in quantitative research in the social sciences (and beyond).

— Page 2, “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data,” 2012.

Best Practices in Data Cleaning