We can use K-means and Principle Component Analysis(PCA) for clustering images on the Fashion MNIST dataset. We will also visually analyze the results using matplotlib and plotly.
What we will be doing here is train a K-means clustering model on the f-MNIST data so that it is able to cluster the images of the data-set with relative accuracy and the clusters have some logic to them which we can understand and interpret. We will then visually analyze the results of the clustering using matplotlib and plotly with reference to the actual labels (y) and draw a rough conclusion on how k-means clustering performs on an image data-set. The final code is available on the link at the end.The words features and components have been used interchangeably in this article.
Clustering is an unsupervised machine learning algorithm and it recognizes patterns without specific labels and clusters the data according to the features. In our case, we will see if a clustering algorithm (k-means) can find a pattern between different images of the apparel in f-MNIST without the labels (y).
A gif illustrating how K-means works. Each red dot is a centroid and each different color represents a different cluster. Every frame is an iteration where the centroid is relocated.
K-means clustering works by assigning a number of centroids based on the number of clusters given. Each data point is assigned to the cluster whose centroid is nearest to it. The algorithm aims to minimize the squared Euclidean distances between the observation and the centroid of cluster to which it belongs.Principal Component Analysis or PCA is a method of reducing the dimensions of the given dataset while still retaining most of its variance. Wikipedia defines it as, “PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.”
PCA visualisation. The best PC (black moving line) is when the total length of those red lines are minimum. It will be used instead of the horizontal and vertical components
Basically PCA reduces the dimensions of the dataset while conserving most of the information. For e.g. if a data-set has 500 features, it gets reduced to 200 features depending on the specified amount of variance retained. Higher the variance retained,more information is conserved, but more the resulting dimensions will be.Less dimensions means less time to train and test the model. In some cases models which use data-set with PCA perform better than the original dataset.
#image #k-means #machine-learning #data-science