A Beginner’s Guide To Cleaning Data In R — Part 1

A Beginner’s Guide To Cleaning Data In R — Part 1

My journey into the vast world of data has been a fun and enthralling ride. I have been glued to my courses, waiting to finish one so I can proceed to the next.

My journey into the vast world of data has been a fun and enthralling ride. I have been glued to my courses, waiting to finish one so I can proceed to the next. After completing introductory courses, I made my way over to data cleaning. It is no secret that most of the effort in any data science project goes into cleaning the data set and tidying it up for analysis. Therefore, it is crucial to have substantial knowledge about this topic.

Firstly, to understand the need for clean data, we need to look at the workflow for a typical data science project. Data is first accessed, followed by manipulation and analysis of the data. Afterward, insights are extracted, and finally, visualized and reported.

Accessing data, followed by analysis, insight generation and visualisation.

Typical Project Workflow

Errors and mistakes in data, if present, could end up generating errors throughout the entire workflow. Ultimately, the insights generated that are used to make critical business decisions are incorrect, which may lead to monetary and business losses. Thus, if untidy data is not tackled and corrected in the first step, the compounding effect can be immense.

This guide will serve as a quick onboarding tool for data cleaning by compiling all the necessary functions and actions that should be taken. I will briefly describe three types of common data errors and then explain how these can be identified in data sets and corrected. I will also be introducing some powerful cleaning and manipulation libraries including dplyr, stringr, and assertive. These can be installed by simply writing the following code in RStudio:

install.packages("tidyverse")
install.packages("assertive")

1. Incorrect Data Type

When data is imported, a possibility exists that RStudio incorrectly interprets a data column type, or the data column was wrongly labeled during extraction. For example, a common error is when numeric data containing numbers are improperly identified and labeled as a character type.

a) Identification

Firstly, to identify incorrect data type errors, the glimpse function is used to check the data types of all columns. The glimpse function is part of the *dplyr *package which needs to be installed before glimpse can be used. Glimpse will return all the columns with their respective data types.

library(dplyr)
glimpse(dataset)

Another form of logical checks includes the is function. The is function can be used for each data type and will return with a logical output (true/false). I have only mentioned the common is functions, but it can be used for all data types. If a numeric column is an argument for the is.numeric function, the output will be true, while if a character column is an argument for the is.numeric function the output will be false.

is.numeric(column_name)
is.character(column_name)

b) Correction

After all the incorrect data type columns have been identified, they can simply be converted to the correct data type by using the as functions. For example, if a numeric data type has been incorrectly imported as a character data type, the as.numeric function will convert it to numeric data type.

as.numeric(column_name)

data-analysis data-scientist data data-cleaning r data analysis

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Data Cleaning in R for Data Science

A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis.

5 Data Structures to Master in R if you want to be a Data Scientist

5 Data Structures to Master in R if you want to be a Data Scientist: Learn how to master the basic data types, and advanced data structures, such as factors, lists, and data frames.

Exploratory Data Analysis is a significant part of Data Science

Data science is omnipresent to advanced statistical and machine learning methods. For whatever length of time that there is data to analyse, the need to investigate is obvious.

Unpivot a column of delimited data with R

Afer explaining how to unpivot columns of delimeted data in Power Query and Python, today I’m extending those explanations to R.

Intro to Data Engineering for Data Scientists

Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews