K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both regression and classification. Its operation can be compared to the following analogy:

Tell me who your neighbors are, I will tell you who you are.

To make a prediction, the KNN algorithm doesn’t calculate a predictive model from a training dataset like in logistic or linear regression. Indeed, KNN doesn’t need to build a predictive model. Thus, for KNN, there is no actual learning phase. That’s why it’s usually categorized as a lazy learning method.

To be able to make a prediction, a KNN uses the dataset to produce a result without any training phase.

KNNs are used in many field:

  • Computer Vision: let’s say you have an image and you want to find sets of images similar to that image, KNN is used — among many others — to find and predict the most prominent images that resemble the input image.
  • Marketing: imagine you have purchased a baby trolley on Amazon, next day you will receive an email from the e-commerce company that you might be interested on a set of pacifiers. Amazon can target you with similar products, or products that other customer that have the same buying habits as you.
  • Content recommendation: probably the most interesting example is Spotify and their legendary recommendation engine. KNN is used on many content recommendation engines, even though more powerful systems are now available, KNN is still relevant to this date.

How does KNN make a prediction?

To make a prediction, the KNN algorithm will use the entire dataset. Indeed, for an observation that isn’t part of the dataset and is the actual value we want to predict, the algorithm will look for the K instances of the dataset closest to our observation.

Then for these K neighbors, the algorithm will use their output in order to calculate the variable y of the observation that we want to predict.

In other words:

  • If KNN is used for a regression problem, the mean (or median) of the y variables of the K closest observations will be used for predictions.
  • If KNN is used for a classification problem, it’s the mode (the value that appears most often) of the variables y of the K closest observations that will be used for predictions.

Understand the algorithm

Input data:

  • A data set D.
  • A distance definition function d.
  • An integer K

Steps:

For a new observation X for which we want to predict its output variable y:

  1. Calculate all the distances of this observation X with the other observations of the dataset D
  2. Retain the K observations from the dataset D close to X using the distance calculation function d
  3. Take the values of y from the K observations retained:
  4. 1. If a regression problem, calculate the mean (or median) of y deductions 2. If a classification problem, calculate the method of y deductions
  5. Return the value calculated in step 3 as the value that was predicted by KNN for observation X.

We can diagram the functioning of KNN by writing it in the following pseudo-code:

#python #machine-learning #developer

Introduction to ML Algorithm K-Nearest Neighbors using Scikit-Learn
2.30 GEEK