Back to Machine Learning Basics - Clustering

In this article, we explore Clustering algorithms, implement them from scratch with Python and learn how to use Sci-Kit Learn implementation.

In a previous couple of articles, we explored some basic machine learning algorithms. Thus far we covered some simple regression algorithms, classification algorithms and we started with algorithms that can be used for both types of problems, like SVMand Decision Trees. We used technologies like TensorFlowPytorch and SciKit Learn for the implementation and application of these algorithms. Apart from that, we used optimization techniques such as Gradient Descent.

Up to this point, we explored algorithms that are using supervised learning. This means that we always had input and expected output data that we used to train our machine learning models. In this type of learning, the training set contains inputs and desired outputs. This way the algorithm can check its calculated output the same as the desired output and take appropriate actions based on that.

However, in real life, we often don’t have both input and output data, but we only have input data. This means that the algorithm on itself needs to figure connections between input samples. For that, we use unsupervised learning. In unsupervised learning, the training set contains only inputs. Just like we solve regression and classification problems with supervised learning with unsupervised learning we solve clustering problems. This technique attempts to identify similar inputs and to put them into categories, ie. it clusters data. Generally speaking, the goal is to detect the hidden patternsamong the data and group them into clusters. This means that the samples which have some shared properties will fall into one group – cluster.

There are many clustering algorithms out there and in this article, we cover three of them: K-Means Clustering, Agglomerative Clustering and DBSCAN. As one can imagine, since the dataset is completely unlabeled, deciding which algorithm is optimal for the chosen dataset is much more complicated. Usually, the performance of each algorithm depends on the unknown properties of the probability distribution the dataset was drawn from.


