We know that we can replace the nan values with mean or median using fillna(). What if the NAN data is correlated to another categorical column?

What if the expected NAN value is a categorical value?

Below are some useful tips to handle NAN values.

Definitely you are doing it with Pandas and Numpy.

import pandas as pd
import numpy as np

ngroup

cl = pd.DataFrame({
'team':['A','A','A','A','A','B','B','B','B','B'],                   'class'['I','I','I','I','I','I','I','II','II','II'],
'value': [1, np.nan, 2, 2, 3, 1, 3, np.nan, 3,1]})

Image for post

Lets assume if you have to fillna for the data of liquor consumption rate, you can just fillna if no other data is relevant to it.

But if the age of the person is given then you can see a pattern in the age and consumption rate variables. Because the liquor consumption will not be in same level for all the people.

An another example is fillna in salary value could be related with age, job title and/or education.

In the above example, let assume that columns test and class are related to value.

Using ngroup you can name the group with the index.

#group-by #fillna #mean #mode #pandas #pandas

Best way to Impute categorical data using Groupby 
11.35 GEEK