Data preparation is the transformation of raw data into a form that is more appropriate for modeling.
It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.
Nevertheless, there are common data preparation tasks across projects. It is a huge field of study and goes by many names, such as “data cleaning,” “data wrangling,” “data preprocessing,” “feature engineering,” and more. Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.
Even though it is a challenging topic to discuss, there are a number of books on the topic.
In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.
Let’s get started.
Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.
The focus here is on data preparation for tabular data, e.g. data in the form of a table with rows and columns as it looks in an excel spreadsheet.
Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.
Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic. We will focus on these books.
I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:
I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-Course
Data cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.
The top books on data cleaning include:
Let’s take a closer look at each in turn.
The book “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” was edited by Q. Ethan Mccallum and was published in 2012.
Bad data is described not only as corrupt data but any data that impairs the modeling process.
It’s tough to nail down a precise definition of “Bad Data.” Some people consider it a purely hands-on, technical phenomenon: missing values, malformed records, and cranky file formats. Sure, that’s part of the picture, but Bad Data is so much more. […] Bad Data is data that gets in the way.
— Page 1, “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work,” 2012.
It is a collection of essays by 19 machine learning practitioners and us full of useful nuggets on data preparation and management.
Bad Data Handbook
The complete table of contents for the book is listed below.
I like this book a lot; it is full of valuable practical advice. I highly recommend it!
Learn More:
The book “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” was written by Jason Osborne and was published in 2012.
This is a more general textbook on data preparation for computational-based social sciences rather than machine learning specifically. Nevertheless, it contains a ton of useful advice.
My goal in writing this book is to collect, in one place, a systematic overview of what I consider to be best practices in data cleaning—things I can demonstrate as making a difference in your data analyses. I seek to change the status quo, the current state of affairs in quantitative research in the social sciences (and beyond).
— Page 2, “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data,” 2012.
Best Practices in Data Cleaning
The complete table of contents for the book is listed below.
I think this is a great reference guide for general data preparation techniques, perhaps better coverage than most “machine learning” focused books given the stronger statistical focus.
#data preparation #data analysis