Data Cleaning in R for Data Science

Data Cleaning in R for Data Science

A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis.

A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis. Because when you have raw data, it has numerous problems that need fixing.

So when we say we are cleaning data into a tidy data set to be used for analysis later, we are actually (among many other things):

1. Removing duplicate values

2. Removing null values

3. Changing column names to readable, understandable, formatted names

4. Removing commas from numeric values i.e. (1,000,657 to 1000657)

5. Converting data types into their appropriate types for analysis

This article is based upon a brief course project I have recently completed in my Data Science Specialization, focused on retrieving raw data, combining it into one dataset and getting it ready for later analysis (not covered in this article). The language opted is R using Rstudio.

The Experiment:

The experiment conducted here is retrieved from UCI Machine Learning Repository where a group of 30 volunteers (age bracket of 19–48 years) performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a Samsung Galaxy S smartphone. The data collected from the embedded accelerometers was divided into testing and trained data. More information regarding the experiment can be found at this link.

http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Step 1: Retrieving Data from URL

The first step required is to obtain the data. Often, to avoid the headache of manually downloading thousands of files, they are downloaded using small code snippets. Since this was a zipped folder, I used the following commands to get started.

download.file(“https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", destfile = “files”, method = “curl”, mode = “wb”)

The download.file functions takes the URL as the first argument and saves it on your local PC in the name you assign to destfile.

unzip(“files”)

This function just unzips the zipped folder.

Step 2: Reading the files into R

features <- read.table(“UCI HAR Dataset/features.txt”, col.names = c(“serial”, “Functions”))

activities <- read.table(“UCI HAR Dataset/activity_labels.txt”, col.names = c(“serial”, “Activity”))
x_test <- read.table(“UCI HAR Dataset/test/X_test.txt”, col.names = features$Functions)
y_test <- read.table(“UCI HAR Dataset/test/y_test.txt”, col.names = “serial”)
subject_test <- read.table(“UCI HAR Dataset/test/subject_test.txt”, col.names = “subject”)
subject_train <- read.table(“UCI HAR Dataset/train/subject_train.txt”, col.names = “subject”)
x_train <- read.table(“UCI HAR Dataset/train/X_train.txt”, col.names = features$Functions)
y_train <- read.table(“UCI HAR Dataset/train/y_train.txt”, col.names = “serial”)

Note: It might be difficult to understand at first what the data means and what column names to use, but after a while you’ll start making sense. For example, it is important to note that the x_test and x_train files are values that refer to the columns in features.txt (hence I’ve linked them up using features$functions)

Making sense of the Data:

After being able to actually look at the files, I found out they were a mess of several files with hundreds of just column names in one .txt file, others having the row values and one having the activity labels. After spending hours of trying to understand the logical representation of data, I was able to visualize it something as follows:

This clearly implies two things:

1) I had to merge the training and test sets by row binding them

2) I had to merge the different attributes of the subjects by column binding them.

This is where step 3 comes into play.

Step 3: Merging the tables intelligently

First, I performed the rbind() function to make one huge dataset.

binded_x <- rbind(x_test, x_train)

binded_y <- rbind(y_test, y_train)
subject <- rbind(subject_test, subject_train)
Next, I used the cbind() function to complete attaching the columns as well.
raw_data_combined <- cbind(subject, binded_x, binded_y)

r data-science-tools data-analytics data-science data-cleaning data analysis

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Exploratory Data Analysis is a significant part of Data Science

Data science is omnipresent to advanced statistical and machine learning methods. For whatever length of time that there is data to analyse, the need to investigate is obvious.

A Beginner’s Guide To Cleaning Data In R — Part 1

My journey into the vast world of data has been a fun and enthralling ride. I have been glued to my courses, waiting to finish one so I can proceed to the next.

Top 10 Data Analytics Tools 2020 | Best Tools for Data Analysis | Data Analytics Training

This session on Top 10 Data Analytics Tools and Techniques will give you a brief understanding of top tools present in the market of data analysis.

Why You Should Learn R — Learn Data Science with Dataquest

Why should you learn R programming when you're aiming to learn data science? Here are six reasons why R is the right language for you.

Introduction to Vectors in R

Learn how to analyze your gambling results using vectors in R. Learn to create vectors in R, name them, select elements from them, and.