Face Dataset Compression using PCA

In my previous article, we took a graphical approach to understand how Principal Component Analysis works and how it can be used for data compression. If you are new to this concept, I strongly suggest you read my previous article before proceeding. I have provided the link below:

In this article, we will learn how PCA can be used to compress a real-life dataset. We will be working with **Labelled Faces in the Wild (LFW), **a large scale dataset consisting of 13233 human-face grayscale images, each having a dimension of 64x64. It means that the data for each face is 4096 dimensional (there are 64x64 = 4096 unique values to be stored for each face). We will reduce this dimension requirement, using PCA, to just a few hundred dimensions!

Introduction

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, exploiting the fact that the images in these datasets have something in common. For instance, in a dataset consisting of face photographs, each photograph will have facial features like eyes, nose, mouth. Instead of encoding this information pixel by pixel, we could make a template of each type of these features and then just combine these templates to generate any face in the dataset. In this approach, each template will still be 64x64 = 4096 dimensional, but since we will be reusing these templates (basis functions) to generate each face in the dataset, the number of templates required will be small. PCA does exactly this. Let’s see how!

Notebook

You can view the Colab Notebook here:

PCA LFW

colab.research.google.com

Dataset

Let’s visualize some images from the dataset. You can see that each image has a complete face, and the facial features like eyes, nose, and lips are clearly visible in each image. Now that we have our dataset ready, let’s compress it.

Image for post

Compression

PCA is a 4 step process. Starting with a dataset containing n dimensions (requiring n-axes to be represented):

Step 1: Find a new set of basis functions (n-axes) where some axes contribute to most of the variance in the dataset while others contribute very little.
Step 2: Arrange these axes in the decreasing order of variance contribution.
Step 3: Now, pick the top k axes to be used and drop the remaining n-k axes.
Step 4: Now, project the dataset onto these _k _axes.

These steps are well explained in my previous article. After these 4 steps, the dataset will be compressed from n-dimensions to just k-dimensions (k<n).

#data-science #machine-learning #principal-component #python #data-visualization