My journey into the vast world of data has been a fun and enthralling ride. I have been glued to my courses, waiting to finish one so I can proceed to the next.
My journey into the vast world of data has been a fun and enthralling ride. I have been glued to my courses, waiting to finish one so I can proceed to the next. After completing introductory courses, I made my way over to data cleaning. It is no secret that most of the effort in any data science project goes into cleaning the data set and tidying it up for analysis. Therefore, it is crucial to have substantial knowledge about this topic.
Firstly, to understand the need for clean data, we need to look at the workflow for a typical data science project. Data is first accessed, followed by manipulation and analysis of the data. Afterward, insights are extracted, and finally, visualized and reported.
Typical Project Workflow
Errors and mistakes in data, if present, could end up generating errors throughout the entire workflow. Ultimately, the insights generated that are used to make critical business decisions are incorrect, which may lead to monetary and business losses. Thus, if untidy data is not tackled and corrected in the first step, the compounding effect can be immense.
This guide will serve as a quick onboarding tool for data cleaning by compiling all the necessary functions and actions that should be taken. I will briefly describe three types of common data errors and then explain how these can be identified in data sets and corrected. I will also be introducing some powerful cleaning and manipulation libraries including dplyr, stringr, and assertive. These can be installed by simply writing the following code in RStudio:
When data is imported, a possibility exists that RStudio incorrectly interprets a data column type, or the data column was wrongly labeled during extraction. For example, a common error is when numeric data containing numbers are improperly identified and labeled as a character type.
Firstly, to identify incorrect data type errors, the glimpse function is used to check the data types of all columns. The glimpse function is part of the *dplyr *package which needs to be installed before glimpse can be used. Glimpse will return all the columns with their respective data types.
Another form of logical checks includes the is function. The is function can be used for each data type and will return with a logical output (true/false). I have only mentioned the common is functions, but it can be used for all data types. If a numeric column is an argument for the is.numeric function, the output will be true, while if a character column is an argument for the is.numeric function the output will be false.
After all the incorrect data type columns have been identified, they can simply be converted to the correct data type by using the as functions. For example, if a numeric data type has been incorrectly imported as a character data type, the as.numeric function will convert it to numeric data type.
A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis.
5 Data Structures to Master in R if you want to be a Data Scientist: Learn how to master the basic data types, and advanced data structures, such as factors, lists, and data frames.
Data science is omnipresent to advanced statistical and machine learning methods. For whatever length of time that there is data to analyse, the need to investigate is obvious.
Afer explaining how to unpivot columns of delimeted data in Power Query and Python, today I’m extending those explanations to R.
Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews