Feature Engineering Part-1 Mean/ Median Imputation.

Mean or Median Imputation:

The mean or median value should be calculated only in the train set and used to replace NA in both train and test sets. To avoid over-fitting

Mean / Median imputation: definition:

Mean/median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean or median.

Which variables can I impute with Mean / Median Imputation?

· The mean and median can only be calculated on numerical variables, therefore, these methods are suitable for continuous and discrete numerical variables only.

Image for post

Mean/Median Imputation

Assumptions:

1. Data is missing completely at random (MCAR)

2. The missing observations, most likely look like the majority of the observations in the variable (aka, the mean/median)

3. If data is missing completely at random, then it is fair to assume that the missing values are most likely very close to the value of the mean or the median of the distribution, as these represent the most frequent/average observation.

Advantages:

Easy to implement.
Fast way of obtaining complete datasets.
Can be integrated into production (during model deployment).

#naturallanguageprocessing #machine-learning #data-science #deep-learning #feature-engineering #deep learning