1601271540

Clustering falls under the unsupervised learning technique. In this technique, the data is not labelled and there is no defined dependant variable. This type of learning is usually done to identify patterns in the data and/or to group similar data.

In this post, a detailed explanation on the type of clustering techniques and a code walk-through is provided.

#k-means-clustering #hierarchical-clustering #clustering-algorithm #machine-learning

1617625260

Kubernetes today is a hugely prevalent tool in 2021, and more organizations are increasingly running their applications on multiple clusters of Kubernetes. But these multiple cluster architectures often have a combination of multiple cloud providers, multiple data centers, multiple regions, and multiple zones where the applications are running. So, deploying your application or service on clusters with such diverse resources is a complicated endeavor. This challenge is what the process of a federation is intended to help overcome. The fundamental use case of a federation is to scale applications on multiple clusters with ease. The process negates the need to perform the deployment step more than once. Instead, you perform one deployment, and the application is deployed on multiple clusters as listed in the federation list.

Essentially, the Kubernetes cluster federation is a mechanism to provide one way or one practice to distribute applications and services to multiple clusters. One of the most important things to note is that federation is not about cluster management, federation is about application management.

Cluster federation is a way of federating your existing clusters as one single curated cluster. So, if you are leveraging Kubernetes clusters in different zones in different countries, you can treat all of them as a single cluster.

In cluster federation, we optimize a host cluster and multiple-member clusters. The host cluster comprises all the configurations which pass on all the member clusters. Member clusters are the clusters that share the workloads. It is possible to have a host cluster also share the workload and act as a member cluster, but organizations tend to keep the host clusters separate for simplicity. On the host cluster, it’s important to install the cluster registry and the federated API. Now with the cluster registry, the host will have all the information to connect to the member Clusters. And with the federated API, you require all the controllers running on our host clusters to make sure they reconcile the federated resources. In a nutshell, the host cluster will act as a control plane and propagate and push configuration to the member clusters.

#kubernetes #cluster #cluster management #federation #federation techniques #cluster communication

1601271540

Clustering falls under the unsupervised learning technique. In this technique, the data is not labelled and there is no defined dependant variable. This type of learning is usually done to identify patterns in the data and/or to group similar data.

In this post, a detailed explanation on the type of clustering techniques and a code walk-through is provided.

#k-means-clustering #hierarchical-clustering #clustering-algorithm #machine-learning

1600190040

SciPy is the most efficient open-source library in python. The main purpose is to compute mathematical and scientific problems. There are many sub-packages in SciPy which further increases its functionality. This is a very important package for data interpretation. We can segregate clusters from the data set. We can perform clustering using a single or multi-cluster. Initially, we generate the data set. Then we perform clustering on the data set. Let us learn more SciPy Clusters.

It is a method that can employ to determine clusters and their center. We can use this process on the raw data set. We can define a cluster when the points inside the cluster have the minimum distance when we compare it to points outside the cluster. The k-means method operates in two steps, given an initial set of k-centers,

- We define the cluster data points for the given cluster center. The points are such that they are closer to the cluster center than any other center.
- We then calculate the mean for all the data points. The mean value then becomes the new cluster center.

The process iterates until the center value becomes constant. We then fix and assign the center value. The implementation of this process is very accurate using the SciPy library.

#numpy tutorials #clustering in scipy #k-means clustering in scipy #scipy clusters #numpy

1596674880

Understanding how to evaluate clusters

**Clustering **is defined as finding natural groups in the data. But this definition is inherently subjective.

**What are natural groups?**

If we see the below picture, can we figure out the natural group of the flowers? Is it by the shape or is it by the color? It may even be by the size or species of the flower. Hence, t_he notion of a natural group changes based on what characteristics we are focussing on._

Fig 1: Flowers (Source: Unsplash)

Let’s take another example, where we have some points or observations in a 2D plane, i.e. we have two attributes only

Fig 2: Original Data and clustering with different number of clusters

If we look at the above figure which has three subfigures. The first subfigure has the original data, the second and third subfigure shows clustering with the number of clusters as two and four respectively (Observations belonging to the same cluster are marked with the same color).

Fortunately, we can still visualize and try to gauge the quality of the clusters, however, if we go for more numbers features, we can’t visualize and see. Hence, there needs to be a mechanism, some measure which can make us compare two or more sets of clusters, or maybe two or more clustering algorithms on the same set of data. *Unfortunately, like the way we can compare classification algorithms using accuracy or in case of regression using mean squared error, it’s not so clear cut for clustering.*

**Clustering Tendency:**

What if the data do not have any clustering tendency, even if the data is random and we apply k-means, the algorithm will generate k-clusters. Hence, how do we measure, if the data has a clustering tendency or not? To measure the same we take the help of Hopkins Statistic.

**Hopkins Statistic (H)**

In this scheme, as many artificially generated random points are added as there are original data points in the dataset. For each of the original points, the distance with it’s nearest neighbor is calculated, denoted by **w **and the same exercise is repeated for the artificially generated points. Here, distance with the nearest neighbor is calculated as **u.**

A value near 0.5 indicates the data do not have clustering tendencies as both of w and p are equal.

**Cluster Evaluation Measures:**

**Sum of Squared Error (SSE):-**

The most used clustering evaluation tool is the sum of squared error which is given by the below equations.

SSE Equations (Image Source: Authors)

Basically, at the first step, we find the centroid of each cluster by taking an average of all the observations in that cluster.

- Then we find how much the points in that clusters deviate from the center and sum it.
- Then we sum this deviation or error of individual clusters.
- SSE should be as low as possible.

I always understand the intuitions better with an example, let’s just do that

#clustering #data-science #cluster #data analysis

1601110320

K-means is one of the most widely used unsupervised clustering methods.

The **K-means **algorithm clusters the data at hand by trying to separate samples into **K** groups of equal variance, minimizing a criterion known as the ** inertia** or

The k-means algorithm divides a set of **N **samples (stored in a data matrix **X**) into **K** disjoint clusters **C**, each described by the mean ***μj***of the samples in the cluster. The means are commonly called the cluster “**centroids”.**

**K-means **algorithm falls into the family of **unsupervised** **machine** **learning** algorithms/methods. For this family of models, the research needs to have at hand a dataset with some observations **without** the need of having also the **labels**/**classes** of the observations. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data.

Now let’s discover the **mathematical foundations** of the algorithm.

#artificial-intelligence #clustering #data-science #cluster-analysis #machine-learning