1595558700

K-means and Kohonen SOM are two of the most widely applied data clustering algorithms.

Although K-means is a simple vector quantization method and Kohonen SOM is a neural network model, they’re remarkably similar.

In this post, I’ll try to explain, in as plain a language as I can, how each of these unsupervised models works.

K-means clustering was introduced to us back in the late 1960s. The goal of the algorithm is to find and group similar data objects into a number (K) of clusters.

By ‘similar’ we mean data points that are both close to each other (in the Euclidean sense) and close to the same cluster center.

The centroids in these clusters *move* after each iteration during training: for each cluster, the algorithm calculates the weighted average (mean) of all its data points and that becomes the new centroid.

K (the number of clusters) is a tunable hyperparameter. This means it’s not learned and we must set it manually.

*This is how K-means is trained:*

- We give the algorithm a set of data points and a number K of clusters as input.
- It places K centroids in random places and computes the distances between each data point and each centroid. We can use Euclidean, Manhattan, Cosine, or some other distance measure - the choice will depend on our specific dataset and objective.
- The algorithm assigns each data point to the cluster whose centroid is nearest to it.
- The algorithm recomputes centroid positions. It takes all input vectors in a cluster and averages them out to figure out the new position.
- The algorithm keeps looping through steps 3 and 4 until convergence. Typically, we finish the training phase when the centroids stop moving and datapoints stop changing cluster assignments, or we can just
*tell*the algorithm how many iterations we want.

K-means is easier to implement and faster than most other clustering algorithms, but it has some major flaws. Here are a few of them:

- With K-means, the end result of clustering will largely depend on the initial centroid placement.
- The algorithm is extremely sensitive to outliers and lacks scalability.

_Outliers, like the one shown here, can really mess up the K-means algorithm. _

- We need to specify K and it’s not always obvious what the good value for K is (although there are a few techniques that can help figure out the optimal number of clusters such as the elbow method or the silhouette method)
- K-means only works on numerical variables (obviously, we can’t compute a mean of categorical variables such as ‘bicycle_’, '
*car’, ‘horse’*, _etc.) - It performs poorly on high-dimensional data.

The Kohonen SOM is an unsupervised neural network commonly used for high-dimensional data clustering.

Although it’s a deep learning model, its architecture, unlike that of most advanced neural nets, is fairly straightforward. It only has three layers.

**Input layer**— inputs in an n-dimensional space.**Weight layer**— adjustable weight vectors that belong to the network’s processing units.**Kohonen Layer**— computational layer that consists of processing units organized in a 2D lattice-like structure (or 1D string-like structure.)

SOMs’ distinct property is that they can map high-dimensional input vectors onto spaces with fewer dimensions and preserve datasets’ original topology while doing so.

1. We initialize weight vectors values randomly.

2. Each neuron computes its respective value of a discriminant function, which is typically the squared Euclidean distance between the neuron’s weight vector and the input vector, for each input pattern. The unit whose weight vector values are closest to those of the input is declared the winning node (the best matching unit).

#machine learning #k-means #som #algorithms

1595558580

K-means and Kohonen SOM are two of the most widely applied data clustering algorithms.

Although K-means is a simple vector quantization method and Kohonen SOM is a neural network model, they’re remarkably similar.

In this post, I’ll try to explain, in as plain a language as I can, how each of these unsupervised models works.

K-means clustering was introduced to us back in the late 1960s. The goal of the algorithm is to find and group similar data objects into a number (K) of clusters.

By ‘similar’ we mean data points that are both close to each other (in the Euclidean sense) and close to the same cluster center.

The centroids in these clusters *move* after each iteration during training: for each cluster, the algorithm calculates the weighted average (mean) of all its data points and that becomes the new centroid.

K (the number of clusters) is a tunable hyperparameter. This means it’s not learned and we must set it manually.

*This is how K-means is trained:*

- We give the algorithm a set of data points and a number K of clusters as input.
- It places K centroids in random places and computes the distances between each data point and each centroid. We can use Euclidean, Manhattan, Cosine, or some other distance measure - the choice will depend on our specific dataset and objective.
- The algorithm assigns each data point to the cluster whose centroid is nearest to it.
- The algorithm recomputes centroid positions. It takes all input vectors in a cluster and averages them out to figure out the new position.
- The algorithm keeps looping through steps 3 and 4 until convergence. Typically, we finish the training phase when the centroids stop moving and datapoints stop changing cluster assignments, or we can just
*tell*the algorithm how many iterations we want.

K-means is easier to implement and faster than most other clustering algorithms, but it has some major flaws. Here are a few of them:

- With K-means, the end result of clustering will largely depend on the initial centroid placement.
- The algorithm is extremely sensitive to outliers and lacks scalability.

_Outliers, like the one shown here, can really mess up the K-means algorithm. _

- We need to specify K and it’s not always obvious what the good value for K is (although there are a few techniques that can help figure out the optimal number of clusters such as the elbow method or the silhouette method)
- K-means only works on numerical variables (obviously, we can’t compute a mean of categorical variables such as ‘bicycle_’, '
*car’, ‘horse’*, _etc.) - It performs poorly on high-dimensional data.

The Kohonen SOM is an unsupervised neural network commonly used for high-dimensional data clustering.

Although it’s a deep learning model, its architecture, unlike that of most advanced neural nets, is fairly straightforward. It only has three layers.

**Input layer**— inputs in an n-dimensional space.**Weight layer**— adjustable weight vectors that belong to the network’s processing units.**Kohonen Layer**— computational layer that consists of processing units organized in a 2D lattice-like structure (or 1D string-like structure.)

SOMs’ distinct property is that they can map high-dimensional input vectors onto spaces with fewer dimensions and preserve datasets’ original topology while doing so.

#machine learning #k-means #som #algorithms

1595558700

K-means and Kohonen SOM are two of the most widely applied data clustering algorithms.

Although K-means is a simple vector quantization method and Kohonen SOM is a neural network model, they’re remarkably similar.

In this post, I’ll try to explain, in as plain a language as I can, how each of these unsupervised models works.

K-means clustering was introduced to us back in the late 1960s. The goal of the algorithm is to find and group similar data objects into a number (K) of clusters.

By ‘similar’ we mean data points that are both close to each other (in the Euclidean sense) and close to the same cluster center.

The centroids in these clusters *move* after each iteration during training: for each cluster, the algorithm calculates the weighted average (mean) of all its data points and that becomes the new centroid.

K (the number of clusters) is a tunable hyperparameter. This means it’s not learned and we must set it manually.

*This is how K-means is trained:*

- We give the algorithm a set of data points and a number K of clusters as input.
- It places K centroids in random places and computes the distances between each data point and each centroid. We can use Euclidean, Manhattan, Cosine, or some other distance measure - the choice will depend on our specific dataset and objective.
- The algorithm assigns each data point to the cluster whose centroid is nearest to it.
- The algorithm recomputes centroid positions. It takes all input vectors in a cluster and averages them out to figure out the new position.
- The algorithm keeps looping through steps 3 and 4 until convergence. Typically, we finish the training phase when the centroids stop moving and datapoints stop changing cluster assignments, or we can just
*tell*the algorithm how many iterations we want.

K-means is easier to implement and faster than most other clustering algorithms, but it has some major flaws. Here are a few of them:

- With K-means, the end result of clustering will largely depend on the initial centroid placement.
- The algorithm is extremely sensitive to outliers and lacks scalability.

_Outliers, like the one shown here, can really mess up the K-means algorithm. _

- We need to specify K and it’s not always obvious what the good value for K is (although there are a few techniques that can help figure out the optimal number of clusters such as the elbow method or the silhouette method)
- K-means only works on numerical variables (obviously, we can’t compute a mean of categorical variables such as ‘bicycle_’, '
*car’, ‘horse’*, _etc.) - It performs poorly on high-dimensional data.

The Kohonen SOM is an unsupervised neural network commonly used for high-dimensional data clustering.

Although it’s a deep learning model, its architecture, unlike that of most advanced neural nets, is fairly straightforward. It only has three layers.

**Input layer**— inputs in an n-dimensional space.**Weight layer**— adjustable weight vectors that belong to the network’s processing units.**Kohonen Layer**— computational layer that consists of processing units organized in a 2D lattice-like structure (or 1D string-like structure.)

SOMs’ distinct property is that they can map high-dimensional input vectors onto spaces with fewer dimensions and preserve datasets’ original topology while doing so.

1. We initialize weight vectors values randomly.

2. Each neuron computes its respective value of a discriminant function, which is typically the squared Euclidean distance between the neuron’s weight vector and the input vector, for each input pattern. The unit whose weight vector values are closest to those of the input is declared the winning node (the best matching unit).

#machine learning #k-means #som #algorithms

1621443060

This article provides an overview of core data science algorithms used in statistical data analysis, specifically k-means and k-medoids clustering.

Clustering is one of the major techniques used for statistical data analysis.

As the term suggests, “clustering” is defined as the process of gathering similar objects into different groups or distribution of datasets into subsets with a defined distance measure.

*K-means* clustering is touted as a foundational algorithm every data scientist ought to have in their toolbox. The popularity of the algorithm in the data science industry is due to its extraordinary features:

- Simplicity
- Speed
- Efficiency

#big data #big data analytics #k-means clustering #big data algorithms #k-means #data science algorithms

1594370172

K-means and Kohonen SOM are two of the most widely applied data clustering algorithms.

*move* after each iteration during training: for each cluster, the algorithm calculates the weighted average (mean) of all its data points and that becomes the new centroid.

*This is how K-means is trained:*

- We give the algorithm a set of data points and a number K of clusters as input.
- The algorithm assigns each data point to the cluster whose centroid is nearest to it.
- The algorithm keeps looping through steps 3 and 4 until convergence. Typically, we finish the training phase when the centroids stop moving and datapoints stop changing cluster assignments, or we can just
*tell*the algorithm how many iterations we want.

- With K-means, the end result of clustering will largely depend on the initial centroid placement.
- The algorithm is extremely sensitive to outliers and lacks scalability.

_Outliers, like the one shown here, can really mess up the K-means algorithm. _

- K-means only works on numerical variables (obviously, we can’t compute a mean of categorical variables such as ‘bicycle_’, '
*car’, ‘horse’*, _etc.) - It performs poorly on high-dimensional data.

#machine learning #k-means #som #unsupervised clustering

1624333080

K-means is one of the simplest unsupervised machine learning algorithms that solve the well-known data clustering problem. Clustering is one of the most common data analysis tasks used to get an intuition about data structure. It is defined as finding the subgroups in the data such that each data points in different clusters are very different. We are trying to find the homogeneous subgroups within the data. Each group’s data points are similarly based on similarity metrics like a Euclidean-based distance or correlation-based distance.

The algorithm can do clustering analysis based on features or samples. We try to find the subcategory of sampling based on attributes or try to find the subcategory of parts based on samples. The practical applications of such a procedure are many: the best use of clustering in amazon and Netflix recommended system, given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some structure of data that might indicate that the data is separable.

K-means the clustering algorithm whose primary goal is to group similar elements or data points into a cluster.

K in k-means represents the number of clusters.

A cluster refers to a collection of data points aggregated together because of certain similarities.

K-means clustering is an iterative algorithm that starts with *k* random numbers used as mean values to define clusters. Data points belong to the group represented by the mean value to which they are closest. This mean value co-ordinates called the *centroid*.

Iteratively, the mean value of each cluster’s data points is computed, and the new mean values are used to restart the process till the mean stops changing. The disadvantage of k-means is that it a local search procedure and could miss global patterns.

The *k* initial centroids can be randomly selected. Another approach of determining *k* is to compute the entire dataset’s mean and add _k _random co-ordinates to it to make *k* initial points. Another method is to determine the principal component of the data and divide it into _k _equal partitions. The mean of each section can be used as initial centroids.

#data-science #algorithms #clustering #k-means #machine-learning