Principle Component Analysis (PCA) is arguably a very difficult-to-understand topic for beginners in machine learning. Here, I will try my best to intuitively explain what it is, how the algorithm does what it does. This post assumes you have very basic knowledge of Linear Algebra like matrix multiplication, and vectors.
PCA is a dimensionality-reduction technique used to make large datasets with hundreds of thousands of features into smaller datasets with fewer features while retaining as much information about the dataset as possible.
A perfect example would be:
Notice that in the original dataset there were five features which could be reduced to two features. These two features _generalize _the features on the left.
To make a picture of what’s happening, we use our previous example. A 2-dimensional plane showing the correlation of size to number of rooms in a house can be compressed to a single size feature, as shown below:
If we project the houses on the black line, we would get something like this:
So we need to reduce that projection error (the magnitude of blue lines) in order to retain maximum information.
I will explain some concepts intuitively in order for you to understand the algorithm better.
Mean of any dataset refers to the equilibrium of the dataset. Imagine a rod on which balls are placed at some distance x from the wall:
Summing the distance of the balls from the wall and dividing by the number of balls results the point of equilibrium, where a pivot would balance the rod.
#principal-component #dimensionality-reduction #machine-learning #data-science #python #data analysis