As you may already know, data visualization is an extremely useful data analysis technique to help us identify patterns, trends, and groups. Whether or not you are a visual learner or not, data visualization can serve you well in your future projects. Through this article, we will see the limitations of higher dimensional data sets and understand the intuition behind Principal Component Analysis (PCA) as a solution without going too deep into mathematical theory.

Data can be very diverse and come in all sorts of shapes and sizes and sometimes we may sometimes feel intimidated when we come face-to-face with one we have never seen before especially when its structure is a bit scary. In school, we may have had the opportunity to analyze data that was cleanly formatted with very few variables, maybe two or three variables. And upon our first sight of the real-world we come face-to-face with a data set with 20 variables (now that’s scary)! So, I will explain the problem that arises with high-dimensional data and how PCA can address such a problem.

Conceptual Motivation

Consider the case that we have a data set with N observations and a single variable. The question is:

Are we able to visualize a single variable?

To visualize the 1-dimensional case, we can do so with a line graph which plots a single point for each observation corresponding to its respective value for the specified variable along the number line.

Image for post

Plotting 1-Dimension (5 samples are plotted for simplicity)

Great! We are able to visualize the data when there is just a single variable. Through this 1-dimensional plot, we are able to see that there are observations that tend be on the lower end and other observations that tend to be on the upper end (the power of data visualization… am I right?)

However, this was just for the 1-dimensional case. Let’s ask the question again:

Are we able to visualize two variables together?

Image for post

Plotting 2-Dimension (5 samples are plotted for simplicity)

Looks great! Again, we were able to create a plot of the data for the 1-dimensional and 2-dimensional case. Once again, let us ask the question more:

Are we able to visualize the data for three variables?

Image for post

Plotting 3-Dimension (5 samples are plotted for simplicity)

So what’s the problem?

Let’s now try visualizing 4 variables at once (Wait… we can’t do that). However, visualizing the 4-dimensional and higher dimensional spaces is actually not possible (bummer). So when it comes to high dimensional data sets we can see there are two problems:

We are not able to visualize at dimensions greater than 3
As the number of dimensions increases this creates other complications¹

How PCA can be used to solve this issue?

With this problem mentioned, now we can begin to see the use of PCA. In that, we can take those 20 number of columns and reduce it down to 2 or 3 variables. This can be done if we can some way create 2 or 3 variables that summarize the variability of the original data set well (i.e summarize the data well). And actually, these variables we wish to find are what we call principal components. We say that PCA is a method used to reduce the dimensions of the data set, i.e. a dimension reduction technique.

For a high-level characterization of these principal components, they are going to be values composed of the 20 variables of different weights. Meaning that the 20 variables have varying importance for each component. In other words, we can identify which variables are most important for a given principal component.

And it will be so that we can find some number of principal components that explain the total variability of the original data well.

Therefore, once we identify the 2 or 3 principal components that summarize the original data set well, we can take those components and visualize the data.

#statistics #data-science #principal-component #dimensionality-reduction #data analysis

Conceptual Motivation

towardsdatascience.com

Principle Component Analysis for the Non-Stem Major