As a data scientist, you are more or less going to spend 60-70% of your time cleaning and preparing your data. The process of cleaning, encoding and transforming your raw data in order to bring them into a format that the machine learning model can understand is called Data Pre-processing. This process is often long and cumbersome and most developers consider it to be the least favourite part of a project. Despite being tedious, it is one of the most important techniques that need to be implemented. To simplify the overall process and make it a bit more interesting, python introduces a package called PyJanitor- A Python Tool for Data Cleaning.

This article deals with an overview of what pyjanitor is, how it works and a demonstration of using this package to clean dirty data.

What is pyjanitor?

Initially developed in R as a Janitor library, it was developed in Python due to its convenience. Pyjanitor is an API that is written on top of the popular python library Pandas. Data pre-processing can be thought of as a directed acyclic graph where the starting node is raw data and we implement a series of techniques on this raw data to get usable data. Pandas has been a huge part of the data science ecosystem and pyjanitor API is implemented on pandas using a concept called method chaining.

Microsoft Introduces Indian English, Hindi To Its Neural Text To Speech Service

As a data scientist, you are more or less going to spend 60-70% of your time cleaning and preparing your data. The process of cleaning, encoding and transforming your raw data in order to bring them into a format that the machine learning model can understand is called Data Pre-processing. This process is often long and cumbersome and most developers consider it to be the least favourite part of a project. Despite being tedious, it is one of the most important techniques that need to be implemented. To simplify the overall process and make it a bit more interesting, python introduces a package called PyJanitor- A Python Tool for Data Cleaning.

This article deals with an overview of what pyjanitor is, how it works and a demonstration of using this package to clean dirty data.

What is pyjanitor?

Initially developed in R as a Janitor library, it was developed in Python due to its convenience. Pyjanitor is an API that is written on top of the popular python library Pandas. Data pre-processing can be thought of as a directed acyclic graph where the starting node is raw data and we implement a series of techniques on this raw data to get usable data. Pandas has been a huge part of the data science ecosystem and pyjanitor API is implemented on pandas using a concept called method chaining.

#developers corner #data cleaning #pandas #pyjanitor #python #python packages #python tools

Beginners Guide to Pyjanitor - A Python Tool for Data Cleaning
3.10 GEEK