Data preparation is the transformation of raw data into a form that is more appropriate for modeling.

It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.

Nevertheless, there are common data preparation tasks across projects. It is a huge field of study and goes by many names, such as “data cleaning,” “data wrangling,” “data preprocessing,” “feature engineering,” and more. Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.

Even though it is a challenging topic to discuss, there are a number of books on the topic.

In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.

Let’s get started.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Overview

The focus here is on data preparation for tabular data, e.g. data in the form of a table with rows and columns as it looks in an excel spreadsheet.

Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.

Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic. We will focus on these books.

I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:

  1. Data Cleaning
  2. Data Wrangling
  3. Feature Engineering

I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Data Cleaning

Data cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.

The top books on data cleaning include:

Let’s take a closer look at each in turn.

“Bad Data Handbook”

The book “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” was edited by Q. Ethan Mccallum and was published in 2012.

Bad data is described not only as corrupt data but any data that impairs the modeling process.

It’s tough to nail down a precise definition of “Bad Data.” Some people consider it a purely hands-on, technical phenomenon: missing values, malformed records, and cranky file formats. Sure, that’s part of the picture, but Bad Data is so much more. […] Bad Data is data that gets in the way.

— Page 1, “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work,” 2012.

It is a collection of essays by 19 machine learning practitioners and us full of useful nuggets on data preparation and management.

Bad Data Handbook

The complete table of contents for the book is listed below.

  • Chapter 01: Setting the Pace: What Is Bad Data?
  • Chapter 02: Is It Just Me, or Does This Data Smell Funny?
  • Chapter 03: Data Intended for Human Consumption, Not Machine Consumption
  • Chapter 04: Bad Data Lurking in Plain Text
  • Chapter 05: (Re)Organizing the Web’s Data
  • Chapter 06: Detecting Liars and the Confused in Contradictory Online Reviews
  • Chapter 07: Will the Bad Data Please Stand Up?
  • Chapter 08: Blood, Sweat, and Urine
  • Chapter 09: When Data and Reality Don’t Match
  • Chapter 10: Subtle Sources of Bias and Error
  • Chapter 11: Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
  • Chapter 12: When Databases Attack: A Guide for When to Stick to Files
  • Chapter 13: Crouching Table, Hidden Network
  • Chapter 14: Myths of Cloud Computing
  • Chapter 15: The Dark Side of Data Science
  • Chapter 16: How to Feed and Care for Your Machine-Learning Expert
  • Chapter 17: Data Traceability
  • Chapter 18: Social Media: Erasable Ink?
  • Chapter 19: Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

I like this book a lot; it is full of valuable practical advice. I highly recommend it!

Learn More:

“Best Practices in Data Cleaning”

The book “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” was written by Jason Osborne and was published in 2012.

This is a more general textbook on data preparation for computational-based social sciences rather than machine learning specifically. Nevertheless, it contains a ton of useful advice.

My goal in writing this book is to collect, in one place, a systematic overview of what I consider to be best practices in data cleaning—things I can demonstrate as making a difference in your data analyses. I seek to change the status quo, the current state of affairs in quantitative research in the social sciences (and beyond).

— Page 2, “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data,” 2012.

Best Practices in Data Cleaning

Best Practices in Data Cleaning

The complete table of contents for the book is listed below.

  • Chapter 01: Why Data Cleaning Is Important: Debunking the Myth of Robustness
  • Chapter 02: Power and Planning for Data Collection: Debunking the Myth of Adequate Power
  • Chapter 03: Being True to the Target Population: Debunking the Myth of Representativeness
  • Chapter 04: Using Large Data Sets With Probability Sampling Frameworks: Debunking the Myth of Equality
  • Chapter 05: Screening Your Data for Potential Problems: Debunking the Myth of Perfect Data
  • Chapter 06: Dealing With Missing or Incomplete Data: Debunking the Myth of Emptiness
  • Chapter 07: Extreme and Influential Data Points: Debunking the Myth of Equality
  • Chapter 08: Improving the Normality of Variables Through Box-Cox Transformation: Debunking the Myth of Distributional Irrelevance
  • Chapter 09: Does Reliability Matter? Debunking the Myth of Perfect Measurement
  • Chapter 10: Random Responding, Motivated Misresponding, and Response Sets: Debunking the Myth of the Motivated Participant
  • Chapter 11: Why Dichotomizing Continuous Variables Is Rarely a Good Practice: Debunking the Myth of Categorization
  • Chapter 12: The Special Challenge of Cleaning Repeated Measures Data: Lots of Pits in Which to Fall
  • Chapter 13: Now That the Myths Are Debunked…: Visions of Rational Quantitative Methodology for the 21st Century

I think this is a great reference guide for general data preparation techniques, perhaps better coverage than most “machine learning” focused books given the stronger statistical focus.

#data preparation #data analysis

8 Top Books on Data Cleaning and Feature Engineering
1.40 GEEK