1566022921

Originally published byAntonis Maronikolakisathttps://www.geeksforgeeks.org

We are given a data set of items, with certain features, and values for these features (like a vector). The task is to categorize those items into groups. To achieve this, we will use the kMeans algorithm; an unsupervised learning algorithm.

(It will help if you think of items as points in an n-dimensional space). The algorithm will categorize the items into k groups of similarity. To calculate that similarity, we will use the euclidean distance as measurement.

The algorithm works as follows:

- First we initialize k points, called means, randomly.
- We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
- We repeat the process for a given number of iterations and at the end, we have our clusters.

The “points” mentioned above are called means, because they hold the mean values of the items categorized in it. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature *x* the items have values in [0,3], we will initialize the means with values for *x* at [0,3]).

The above algorithm in pseudocode:

Initialize k means with random valuesFor a given number of iterations:

Iterate through items:

Find the mean closest to the item

Assign item to mean

Update mean

We receive input as a text file (‘data.txt’). Each line represents an item, and it contains numerical values (one for each feature) split by commas. You can find a sample data set here.

We will read the data from the file, saving it into a list. Each element of the list is another list containing the item values for the features. We do this with the following function:

def ReadData(fileName):`# Read the file, splitting by lines f = open(fileName, 'r'); lines = f.read().splitlines(); f.close(); items = []; for i in range(1, len(lines)): line = lines[i].split(','); itemFeatures = []; for j in range(len(line)-1): v = float(line[j]); # Convert feature value to float itemFeatures.append(v); # Add feature value to dict items.append(itemFeatures); shuffle(items); return items; `

We want to initialize each mean’s values in the range of the feature values of the items. For that, we need to find the min and max for each feature. We accomplish that with the following function:

def FindColMinMax(items):

n = len(items[0]);

minima = [sys.maxint for i in range(n)];

maxima = [-sys.maxint -1 for i in range(n)];`for item in items: for f in range(len(item)): if (item[f] < minima[f]): minima[f] = item[f]; if (item[f] > maxima[f]): maxima[f] = item[f]; `

return minima,maxima;

The variables *minima, maxima* are lists containing the min and max values of the items respectively. We initialize each mean’s feature values randomly between the corresponding minimum and maximum in those above two lists:

def InitializeMeans(items, k, cMin, cMax):`# Initialize means to random numbers between # the min and max of each column/feature f = len(items[0]); # number of features means = [[0 for i in range(f)] for j in range(k)]; for mean in means: for i in range(len(mean)): # Set value to a random float # (adding +-1 to avoid a wide placement of a mean) mean[i] = uniform(cMin[i]+1, cMax[i]-1); return means; `

We will be using the euclidean distance as a metric of similarity for our data set (note: depending on your items, you can use another similarity metric).

def EuclideanDistance(x, y):

S = 0; # The sum of the squared differences of the elements

for i in range(len(x)):

S += math.pow(x[i]-y[i], 2);`return math.sqrt(S); #The square root of the sum `

To update a mean, we need to find the average value for its feature, for all the items in the mean/cluster. We can do this by adding all the values and then dividing by the number of items, or we can use a more elegant solution. We will calculate the new average without having to re-add all the values, by doing the following:

m = (m*(n-1)+x)/n

where *m* is the mean value for a feature, *n* is the number of items in the cluster and *x* is the feature value for the added item. We do the above for each feature to get the new mean.

def UpdateMean(n,mean,item):

for i in range(len(mean)):

m = mean[i];

m = (m*(n-1)+item[i])/float(n);

mean[i] = round(m, 3);`return mean;`

Now we need to write a function to classify an item to a group/cluster. For the given item, we will find its similarity to each mean, and we will classify the item to the closest one.

def Classify(means,item):`# Classify item to the mean with minimum distance minimum = sys.maxint; index = -1; for i in range(len(means)): # Find distance from item to mean dis = EuclideanDistance(item, means[i]); if (dis < minimum): minimum = dis; index = i; return index; `

To actually find the means, we will loop through all the items, classify them to their nearest cluster and update the cluster’s mean. We will repeat the process for some fixed number of iterations. If between two iterations no item changes classification, we stop the process as the algorithm has found the optimal solution.

The below function takes as input *k* (the number of desired clusters), the items and the number of maximum iterations, and returns the means and the clusters. The classification of an item is stored in the array *belongsTo* and the number of items in a cluster is stored in *clusterSizes*.

def CalculateMeans(k,items,maxIterations=100000):`# Find the minima and maxima for columns cMin, cMax = FindColMinMax(items); # Initialize means at random points means = InitializeMeans(items,k,cMin,cMax); # Initialize clusters, the array to hold # the number of items in a class clusterSizes= [0 for i in range(len(means))]; # An array to hold the cluster an item is in belongsTo = [0 for i in range(len(items))]; # Calculate means for e in range(maxIterations): # If no change of cluster occurs, halt noChange = True; for i in range(len(items)): item = items[i]; # Classify item into a cluster and update the # corresponding means. index = Classify(means,item); clusterSizes[index] += 1; cSize = clusterSizes[index]; means[index] = UpdateMean(cSize,means[index],item); # Item changed cluster if(index != belongsTo[i]): noChange = False; belongsTo[i] = index; # Nothing changed, return if (noChange): break; return means;`

Finally we want to find the clusters, given the means. We will iterate through all the items and we will classify each item to its closest cluster.

def FindClusters(means,items):

clusters = [[] for i in range(len(means))]; # Init clusters`for item in items: # Classify item into a cluster index = Classify(means,item); # Add item to cluster clusters[index].append(item); return clusters; `

The other popularly used similarity measures are:-

1. **Cosine distance:** It determines the cosine of the angle between the point vectors of the two points in the n dimensional space

2. ** Manhattan distance:** It computes the sum of the absolute differences between the co-ordinates of the two data points.

3. ** Minkowski distance:** It is also known as the generalised distance metric. It can be used for both ordinal and quantitative variables

You can find the entire code on my GitHub, along with a sample data set and a plotting function.

**Thanks for reading** ❤

If you liked this post, share it with all of your programming buddies!

Follow us on **Facebook** | **Twitter**

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Artificial Intelligence A-Z™: Learn How To Build An AI

☞ A Complete Machine Learning Project Walk-Through in Python

☞ Machine Learning: how to go from Zero to Hero

☞ Top 18 Machine Learning Platforms For Developers

☞ 10 Amazing Articles On Python Programming And Machine Learning

☞ 100+ Basic Machine Learning Interview Questions and Answers

#machine-learning #python #data-science

1600190040

SciPy is the most efficient open-source library in python. The main purpose is to compute mathematical and scientific problems. There are many sub-packages in SciPy which further increases its functionality. This is a very important package for data interpretation. We can segregate clusters from the data set. We can perform clustering using a single or multi-cluster. Initially, we generate the data set. Then we perform clustering on the data set. Let us learn more SciPy Clusters.

It is a method that can employ to determine clusters and their center. We can use this process on the raw data set. We can define a cluster when the points inside the cluster have the minimum distance when we compare it to points outside the cluster. The k-means method operates in two steps, given an initial set of k-centers,

- We define the cluster data points for the given cluster center. The points are such that they are closer to the cluster center than any other center.
- We then calculate the mean for all the data points. The mean value then becomes the new cluster center.

The process iterates until the center value becomes constant. We then fix and assign the center value. The implementation of this process is very accurate using the SciPy library.

#numpy tutorials #clustering in scipy #k-means clustering in scipy #scipy clusters #numpy

1621443060

This article provides an overview of core data science algorithms used in statistical data analysis, specifically k-means and k-medoids clustering.

Clustering is one of the major techniques used for statistical data analysis.

As the term suggests, “clustering” is defined as the process of gathering similar objects into different groups or distribution of datasets into subsets with a defined distance measure.

*K-means* clustering is touted as a foundational algorithm every data scientist ought to have in their toolbox. The popularity of the algorithm in the data science industry is due to its extraordinary features:

- Simplicity
- Speed
- Efficiency

#big data #big data analytics #k-means clustering #big data algorithms #k-means #data science algorithms

1596381480

Clustering is an unsupervised learning technique which is used to make clusters of objects i.e. it is a technique to group objects of similar kind in a group. In clustering, we first partition the set of data into groups based on the similarity and then assign the labels to those groups. Also, it helps us to find out various useful features that can help in distinguishing between different groups.

Most common categories of clustering are:-

- Partitioning Method
- Hierarchical Method
- Density-based Method
- Grid-based Method
- Model-based Method

Partitioning method classifies the group of n objects into groups based on the features and similarity of data.

The general problem would be like that we will have ‘n’ objects and we need to construct ‘k’ partitions among the data objects where each partition represents a cluster and will contain at least one object. Also, there is an additional condition that says each object can belong to only one group.

The partitioning method starts by creating an initial random partitioning. Then it iterates to improve the partitioning by moving the objects from one partition to another.

**k-Means** clustering follows the partitioning approach to classify the data.

The hierarchical method performs a hierarchical decomposition of the given set of data objects. It starts by considering every data point as a separate cluster and then iteratively identifies two clusters which can be closest together and then merge these two clusters into one. We continue this until all the clusters are merged together into a single big cluster. A diagram called **Dendrogram **is used torepresent this hierarchy.

There are two approaches depending on how we create the hierarchy −

- Agglomerative Approach
- Divisive Approach

**Agglomerative Approach**

Agglomerative approach is a type of hierarchical method which uses bottom-up strategy. We start with each object considering as a separate cluster and keeps on merging the objects that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds.

#k-means-clustering #machine-learning #clustering #python #code

1601196420

Clustering comes under the data mining topic and there is a lot of research going on in this field and there exist many clustering algorithms.

The following are the main types of clustering algorithms.

*K-Means**Hierarchical clustering**DBSCAN*

Following are some of the applications of clustering

- Customer Segmentation: This is one of the most important use-cases of clustering in the sales and marketing domain. Here the aim is to group people or customers based on some similarities so that they can come up with different action items for the people in different groups. One example could be, amazon giving different offers to different people based on their buying patterns.
- Image Segmentation: Clustering is used in image segmentation where similar image pixels are grouped together. Pixels of different objects in the image are grouped together.

#machine-learning #k-means-clustering #clustering #k-means

1595334123

I consider myself an active StackOverflow user, despite my activity tends to vary depending on my daily workload. I enjoy answering questions with angular tag and I always try to create some working example to prove correctness of my answers.

To create angular demo I usually use either plunker or stackblitz or even jsfiddle. I like all of them but when I run into some errors I want to have a little bit more usable tool to undestand what’s going on.

Many people who ask questions on stackoverflow don’t want to isolate the problem and prepare minimal reproduction so they usually post all code to their questions on SO. They also tend to be not accurate and make a lot of mistakes in template syntax. To not waste a lot of time investigating where the error comes from I tried to create a tool that will help me to quickly find what causes the problem.

```
Angular demo runner
Online angular editor for building demo.
ng-run.com
<>
```

Let me show what I mean…

There are template parser errors that can be easy catched by stackblitz

It gives me some information but I want the error to be highlighted

#mean stack #angular 6 passport authentication #authentication in mean stack #full stack authentication #mean stack example application #mean stack login and registration angular 8 #mean stack login and registration angular 9 #mean stack tutorial #mean stack tutorial 2019 #passport.js