Customer segmentation is a process that separates a company’s customers into groups based on certain traits (age, gender, income, etc.), and an effective analysis can help a company make better marketing decisions. In this post, I will share a customer segmentation analysis based on data provided by the company — Arvato. Arvato is a supply chain solutions company, and the dataset I used for my analysis contains customer demographics data for a client company. I also built a model to predict the response rate of mail-out advertisements.

Below I will walk you through the details of my analysis using both unsupervised and supervised learning methods. Two datasets, customer data from a client company of Arvato and general demographics data from Germany, were analyzed to answer the following questions:

1. Who are the loyal customers of the client company, and with a change in marketing strategy to expand customer demographics, who are the potential customers to target?

2. When the client company sends out a mail-out offer, can we predict the responding rate?

Part I. Customer Segmentation

Who are the loyal customers of the client company, and with a change in marketing strategy to expand customer demographics, who are the potential customers to target?

Cluster segmentation helps to map the demographics of the client company’s existing customers to the general population in Germany. Here I apply unsupervised learning to identify what segment of the general population represents the loyal customers of the client company, and what segment represents a pool of potential new customers that they might target.

Data Exploration

A data scientist spends 80% of the time cleaning data.

During the course of this project, 90% of my time was spent on data exploring and preprocessing. The following datasets I used in customer segmentation contained an enormous amount of raw data:

  • Azdias — Demographics data for the general population of Germany (891211 x 366).
  • **Customers **— Demographics data for the client company (191652 x 369).

In the general population data, there are 273 columns containing missing value. I decided to drop columns with over 30% missing value rate (indicated by red), since large amount of missing value is harmful for analyzing statistic and constructing model (e.g. ALTER_KIND1 and EXTSEL992). For the remaining columns (the blue), I would impute the missing data with the most frequent value.

#customer-segmentation #data-science

Boosting Sales Through Customer Segmentation Analysis: Knowing the Market Better
1.45 GEEK