A Probabilistic Algorithm to Reduce Dimensions

Data Visualization plays a crucial role in real-time Machine Learning applications. Visualizing data makes a much easier and convenient way to know, interpret, and classify data in many cases. And there are some techniques which can help to visualize data and reduce dimensions of the dataset.

In my previous article, I gave an overview of Principal Component Analysis (PCA) and explained how to implement it. PCA is a basic technique to reduce dimensions and plot data. There are some limitations of using PCA from which the major is, it does not group similar classes together rather it is just a method of transforming point to linear representation which makes it easier for humans to understand data. While t-SNE is designed to overcome this challenge such that it can group similar objects together even in a context of lack of linearity.

This article is categorized into the following sections:

What is t-SNE?
Need/Advantages of t-SNE
Drawbacks of t-SNE
Applications of t-SNE — when to use and when not to use?
Implementation of t-SNE to MNIST dataset using Python
Conclusion

What is t-SNE?

It is a technique that tries to **maintain the local structure **of the data-points which reduces dimensions.

Let’s understand the concept from the name (t — Distributed Stochastic Neighbor Embedding): Imagine, all data-points are plotted in d -dimension(high) space and a data-point is surrounded by the other data-points of the same class and another data-point is surrounded by the similar data-points and of same class and likewise for all classes. So now, if we take any data-point (x) then the surrounding data-points (y, z, etc.) are called the neighborhood of that data-point, neighborhood of any data-point (x) is calculated such that it is** geometrically close** with that neighborhood data-point (y or z), i.e. by calculating the distance between both data-points. So basically, the neighborhood of x contains points that are closer to x. The technique only tries to preserve the distance of the neighborhood.

What is embedding? The data-points plotted in d-dimension are embedded in 2D such that the neighborhood of all data-points are tried to maintain as they were in d-dimension. Basically, for every point in high dimension space, there’s a corresponding point in low dimension space with the neighborhood concept of t-SNE.

t-SNE creates a probability distribution using the Gaussian distribution that defines the relationships between the points in high-dimensional space.

It is stochastic since in every run it’s output changes, that is it is not deterministic.

#deep-learning #dimensionality-reduction #machine-learning #data-visualization #data-science

What is t-SNE?

medium.com

A Probabilistic Algorithm to Reduce Dimensions