A comprehensive guide to using built-in Pandas functions to clean data prior to analysis. Today we will be using Python and Pandas to explore a number of built-in functions that can be used to clean a dataset.
Over time companies produce and collect a massive amount of data, depending on the company this can come in many different forms such as user-generated content, job applicant data, blog posts, sensor data and payroll transactions. Due to the immense number of source systems that can generate data and the number of people that contribute to data generation we can never guarantee that the data we are receiving is a clean record. These records may be incomplete due to missing attributes, they may have an incorrect spelling for user-entered text fields or they may have an incorrect value such as a date of birth in the future.
As a data scientist, it's important that these data quality issues are recognised early during our exploration phase and cleansed prior to any analysis. By allowing uncleaned data through our analysis tools we run the risk of incorrectly representing companies or users data by delivering poor quality findings based on incorrect data. Today we will be using Python and Pandas to explore a number of built-in functions that can be used to clean a dataset.
Learn to group the data and summarize in several different ways, to use aggregate functions, data transformation, filter, map.
Master Applied Data Science with Python and get noticed by the top Hiring Companies with IgmGuru's Data Science with Python Certification Program. Enroll Now
Many a time, I have seen beginners in data science skip exploratory data analysis (EDA) and jump straight into building a hypothesis function or model. In my opinion, this should not be the case.
Python for Data Science, you will be working on an end-to-end case study to understand different stages in the data science life cycle. This will mostly deal with "data manipulation" with pandas and "data visualization" with seaborn. After this, an ML model will be built on the dataset to get predictions. You will learn about the basics of the sci-kit-learn library to implement the machine learning algorithm.
Python for Data Science, you will be working on an end-to-end case study to understand different stages in the data science life cycle. You will learn about the basics of the sci-kit-learn library to implement the machine learning algorithm.