Did you ever use a clustering method before? What was the most difficult part for you? Usually, I do clustering with these steps: scaling the input features, dimensionality reduction, and choosing one clustering algorithm that could perform well on the data. These steps are pretty standard, right? But, the problem lies ahead: understanding the clustering result.
Understanding or interpreting the clustering result usually takes time. We do some statistical analysis and visualisations to compare the clusters. If we change the dimensionality reduction or clustering method, the clusters will change and we need to redo the analysis. Interpreting clustering result becomes the bottleneck that hinders us from quickly iterating the whole process.
**My initial interpretation of the clustering result is as simple as calling a function ****cluster_report(features, clustering_result)**
. In the following section, I will give an example of clustering and the result of cluster_report
. If you want to skip the example, you can scroll to the bottom of this article to get the code and Google Collab notebook.
Let’s use Scikit’s wine dataset as our example. This dataset has 13 numeric features and a label which indicate the type of wine. Below are the samples of the data.
label alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
0 13.74 1.67 2.25 16.4 118.0 2.6 2.9 0.21 1.62 5.85 0.92 3.2 1060.0
2 12.79 2.67 2.48 22.0 112.0 1.48 1.36 0.24 1.26 10.8 0.48 1.47 480.0
1 12.37 1.13 2.16 19.0 87.0 3.5 3.1 0.19 1.87 4.45 1.22 2.87 420.0
0 13.56 1.73 2.46 20.5 116.0 2.96 2.78 0.2 2.45 6.25 0.98 3.03 1120.0
1 13.05 5.8 2.13 21.5 86.0 2.62 2.65 0.3 2.01 2.6 0.73 3.1 380.0
view raw
wine_sample_5.csv hosted with ❤ by GitHub
First, we need to standardise the data to prevent the clustering dominated by features with bigger scale. In this case, we use zero mean and unit variance standardisation. After that, we use PCA (Principal Component Analysis) to reduce the dimensions from 13 features to 2 features/principal-components.
We use KMeans clustering for this example because most of us know about it. To determine the number of clusters for KMeans clustering, we use the elbow method and got k=3 as the optimal one.
Using KMeans with k=3 on the two principal components, we got the clustering result below. The left scatters plot is showing the original label. The right scatters plot is showing the clustering result.
After having the clustering result, we need to interpret the clusters. The easiest way to describe clusters is by using a set of rules. We could automatically generate the rules by training a decision tree model using original features and clustering result as the label. I wrote a cluster_report
function that wraps the decision tree training and rules extraction from the tree.** You could simply call **cluster_report**
to describe the clusters**. Easy, right?
There are two parameters that we can adjust: min_samples_leaf
and pruning_level
. Those parameters are controlling the decision tree complexity. To get a more general rule, we could increase the value of min_samples_leaf
or pruning_level
. Otherwise, if we want to get a more detail rule, we could decrease the value of min_samples_leaf
or pruning_level
.
The number in the bracket is showing the proportion of class_name
satisfying the rule. For example, **[0.880]** (proline > 755.0)
means for all instances that satisfy (proline > 775.0)
rule, 88% of them are in cluster 1.
#clustering #data-science #data-analysis #programming #data analysis