One Hot Encoding, Standardization, PCA : Data preparation steps for segmentation in python

Data driven customer targeting or product bundling are critical for businesses to stay relevant against the intense competition they face. Consumers are now spoilt for choice and prefer personalized product offerings. With the coming of the fourth industrial revolution in the form of the immense growth of artificial intelligence and big data technologies, there is no better time to leverage segmentation models to perform such analysis. But before we do a deep dive into these models, we should be aware of what kind of data is needed for these models. This is the focus of my blog as we will be going through all the steps necessary for transforming our raw dataset to the format we need for training and testing our segmentation algorithms.

The Data

For this exercise, we will be working with clickstream data from an online store offering clothing for pregnant women. It has data are from April 2008 to August 2008 and includes variables like product category, location of the photo on the webpage, country of origin of the IP address and product price in US dollars. The reason I chose this dataset is that clickstream data is becoming a very important source of providing fine-grained information about customer behaviour. It also provides us a dataset with typical challenges like high dimensionality, need for feature engineering, presence of categorical variables and different scales of fields.

We will try to prepare the data for product segmentation by performing the following steps:

Exploratory Data Analysis (EDA)
Feature Engineering
One Hot Encoding
Standardisation
PCA

#unsupervised-learning #ai #segmentation #feature-engineering #clustering

towardsdatascience.com

One Hot Encoding, Standardization, PCA : Data preparation steps for segmentation in python