k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster

Originally published byAntonis Maronikolakisathttps://www.geeksforgeeks.org

We are given a data set of items, with certain features, and values for these features (like a vector). The task is to categorize those items into groups. To achieve this, we will use the kMeans algorithm; an unsupervised learning algorithm.

Overview(It will help if you think of items as points in an n-dimensional space). The algorithm will categorize the items into k groups of similarity. To calculate that similarity, we will use the euclidean distance as measurement.

The algorithm works as follows:

- First we initialize k points, called means, randomly.
- We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
- We repeat the process for a given number of iterations and at the end, we have our clusters.

The “points” mentioned above are called means, because they hold the mean values of the items categorized in it. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature *x* the items have values in [0,3], we will initialize the means with values for *x* at [0,3]).

The above algorithm in pseudocode:

Initialize k means with random valuesRead DataFor a given number of iterations:

Iterate through items:

Find the mean closest to the item

Assign item to mean

Update mean

We receive input as a text file (‘data.txt’). Each line represents an item, and it contains numerical values (one for each feature) split by commas. You can find a sample data set here.

We will read the data from the file, saving it into a list. Each element of the list is another list containing the item values for the features. We do this with the following function:

def ReadData(fileName):Initialize Means`# Read the file, splitting by lines f = open(fileName, 'r'); lines = f.read().splitlines(); f.close(); items = []; for i in range(1, len(lines)): line = lines[i].split(','); itemFeatures = []; for j in range(len(line)-1): v = float(line[j]); # Convert feature value to float itemFeatures.append(v); # Add feature value to dict items.append(itemFeatures); shuffle(items); return items; `

We want to initialize each mean’s values in the range of the feature values of the items. For that, we need to find the min and max for each feature. We accomplish that with the following function:

def FindColMinMax(items):

n = len(items[0]);

minima = [sys.maxint for i in range(n)];

maxima = [-sys.maxint -1 for i in range(n)];`for item in items: for f in range(len(item)): if (item[f] < minima[f]): minima[f] = item[f]; if (item[f] > maxima[f]): maxima[f] = item[f]; `

return minima,maxima;

The variables *minima, maxima* are lists containing the min and max values of the items respectively. We initialize each mean’s feature values randomly between the corresponding minimum and maximum in those above two lists:

def InitializeMeans(items, k, cMin, cMax):Euclidean Distance`# Initialize means to random numbers between # the min and max of each column/feature f = len(items[0]); # number of features means = [[0 for i in range(f)] for j in range(k)]; for mean in means: for i in range(len(mean)): # Set value to a random float # (adding +-1 to avoid a wide placement of a mean) mean[i] = uniform(cMin[i]+1, cMax[i]-1); return means; `

We will be using the euclidean distance as a metric of similarity for our data set (note: depending on your items, you can use another similarity metric).

def EuclideanDistance(x, y):Update Means

S = 0; # The sum of the squared differences of the elements

for i in range(len(x)):

S += math.pow(x[i]-y[i], 2);`return math.sqrt(S); #The square root of the sum `

To update a mean, we need to find the average value for its feature, for all the items in the mean/cluster. We can do this by adding all the values and then dividing by the number of items, or we can use a more elegant solution. We will calculate the new average without having to re-add all the values, by doing the following:

m = (m*(n-1)+x)/n

where *m* is the mean value for a feature, *n* is the number of items in the cluster and *x* is the feature value for the added item. We do the above for each feature to get the new mean.

def UpdateMean(n,mean,item):Classify Items

for i in range(len(mean)):

m = mean[i];

m = (m*(n-1)+item[i])/float(n);

mean[i] = round(m, 3);`return mean;`

Now we need to write a function to classify an item to a group/cluster. For the given item, we will find its similarity to each mean, and we will classify the item to the closest one.

def Classify(means,item):Find Means`# Classify item to the mean with minimum distance minimum = sys.maxint; index = -1; for i in range(len(means)): # Find distance from item to mean dis = EuclideanDistance(item, means[i]); if (dis < minimum): minimum = dis; index = i; return index; `

To actually find the means, we will loop through all the items, classify them to their nearest cluster and update the cluster’s mean. We will repeat the process for some fixed number of iterations. If between two iterations no item changes classification, we stop the process as the algorithm has found the optimal solution.

The below function takes as input *k* (the number of desired clusters), the items and the number of maximum iterations, and returns the means and the clusters. The classification of an item is stored in the array *belongsTo* and the number of items in a cluster is stored in *clusterSizes*.

def CalculateMeans(k,items,maxIterations=100000):Find Clusters`# Find the minima and maxima for columns cMin, cMax = FindColMinMax(items); # Initialize means at random points means = InitializeMeans(items,k,cMin,cMax); # Initialize clusters, the array to hold # the number of items in a class clusterSizes= [0 for i in range(len(means))]; # An array to hold the cluster an item is in belongsTo = [0 for i in range(len(items))]; # Calculate means for e in range(maxIterations): # If no change of cluster occurs, halt noChange = True; for i in range(len(items)): item = items[i]; # Classify item into a cluster and update the # corresponding means. index = Classify(means,item); clusterSizes[index] += 1; cSize = clusterSizes[index]; means[index] = UpdateMean(cSize,means[index],item); # Item changed cluster if(index != belongsTo[i]): noChange = False; belongsTo[i] = index; # Nothing changed, return if (noChange): break; return means;`

Finally we want to find the clusters, given the means. We will iterate through all the items and we will classify each item to its closest cluster.

def FindClusters(means,items):

clusters = [[] for i in range(len(means))]; # Init clusters`for item in items: # Classify item into a cluster index = Classify(means,item); # Add item to cluster clusters[index].append(item); return clusters; `

The other popularly used similarity measures are:-

1. **Cosine distance:** It determines the cosine of the angle between the point vectors of the two points in the n dimensional space

2. ** Manhattan distance:** It computes the sum of the absolute differences between the co-ordinates of the two data points.

3. ** Minkowski distance:** It is also known as the generalised distance metric. It can be used for both ordinal and quantitative variables

You can find the entire code on my GitHub, along with a sample data set and a plotting function.

**Thanks for reading** ❤

If you liked this post, share it with all of your programming buddies!

Follow us on **Facebook** | **Twitter**

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Artificial Intelligence A-Z™: Learn How To Build An AI

☞ A Complete Machine Learning Project Walk-Through in Python

☞ Machine Learning: how to go from Zero to Hero

☞ Top 18 Machine Learning Platforms For Developers

☞ 10 Amazing Articles On Python Programming And Machine Learning

☞ 100+ Basic Machine Learning Interview Questions and Answers

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Machine Learning, Data Science and Deep Learning with PythonExplore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:

• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)

• The History of Artificial Neural Networks

• Deep Learning in the Tensorflow Playground

• Deep Learning Details

• Introducing Tensorflow

• Using Tensorflow

• Introducing Keras

• Using Keras to Predict Political Parties

• Convolutional Neural Networks (CNNs)

• Using CNNs for Handwriting Recognition

• Recurrent Neural Networks (RNNs)

• Using a RNN for Sentiment Analysis

• The Ethics of Deep Learning

• Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.

This Edureka video on 'Python For Data Science - How to use Data Science with Python - Data Science using Python ' will help you understand how we can use python for data science along with various use cases. What is Data Science? Why Python? Python Libraries For Data Science. Roadmap To Data Science With Python. Data Science Jobs and Salary Trends

This Edureka video on 'Python For Data Science - How to use Data Science with Python - Data Science using Python

' will help you understand how we can use python for data science along with various use cases. Following are the topics discussed this Python Data Science Tutorial:

- What is Data Science?
- Why Python?
- Python Libraries For Data Science
- Roadmap To Data Science With Python
- Data Science Jobs and Salary Trends
- How Edureka Helps?

Best Python Libraries For Data Science & Machine Learning | Data Science Python Libraries

This video will focus on the top Python libraries that you should know to master Data Science and Machine Learning. Here’s a list of topics that are covered in this session:

- Introduction To Data Science And Machine Learning
- Why Use Python For Data Science And Machine Learning?
- Python Libraries for Data Science And Machine Learning
- Python libraries for Statistics
- Python libraries for Visualization
- Python libraries for Machine Learning
- Python libraries for Deep Learning
- Python libraries for Natural Language Processing

**Thanks for reading** ❤

If you liked this post, share it with all of your programming buddies!

Follow us on **Facebook** | **Twitter**

☞ Complete Python Bootcamp: Go from zero to hero in Python 3

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python and Django Full Stack Web Developer Bootcamp

☞ Python Tutorial - Python GUI Programming - Python GUI Examples (Tkinter Tutorial)

☞ Computer Vision Using OpenCV

☞ OpenCV Python Tutorial - Computer Vision With OpenCV In Python

☞ Python Tutorial: Image processing with Python (Using OpenCV)

☞ A guide to Face Detection in Python

☞ Machine Learning Tutorial - Image Processing using Python, OpenCV, Keras and TensorFlow

☞ PyTorch Tutorial for Beginners

☞ The Pandas Library for Python

☞ Introduction To Data Analytics With Pandas