Pandas is a very powerful and versatile Python data analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

Although the built-in functions of Pandas are capable of performing efficient data analysis, custom made tools or libraries add value to Pandas. In this post, we will explore 4 tools that enhance the data analysis process with Pandas.


Missingno

Pandas provides functions to check the number of missing values in the dataset. **Missingno **library takes it one step further and provides the distribution of missing values in the dataset by informative visualizations.

Using the plots of Missingno, we are able to see where the missing values are located in each column and if there is a correlation between missing values of different columns. Before handling missing values, it is very important to explore them in the dataset. Thus, I consider **Missingno **as a highly valuable asset in data cleaning and preprocessing steps.

Let’s first try to explore a dataset about the movies on streaming platforms. The dataset is available here on Kaggle.

The dataset contains 16744 movies and 17 features that describe each movie. Pandas **isna **function combined with sum() gives us the number of missing values in each column. But, we need more than the count in some cases. Let’s explore the missing values with Missingno.

import missingno as msno
%matplotlib inline #render plots within jupyter notebook

The first tool we will use is the missing value matrix.

msno.matrix(df)

Image for post

White lines indicate missing values. “Age” and “Rotten Tomatoes” columns are dominated by white lines. But, there is an interesting trend in the other columns that have missing values. They mostly have missing values in common rows. If a row has a missing value in “Directors” columns, it is likely to have missing values in “Genres”, “Country”, “Language”, and “Runtime” columns. This is highly valuable information when handling missing values

#machine-learning #data-science #programming #pandas #artificial-intelligence

4 Must-Know Libraries in Pandas Ecosystem
1.50 GEEK