1597884060

# Principal Component Analysis - now explained in your own terms

The caption in the online magazine “WIRED” caught my eye one night a few months ago. When I focused my eyes on it, it read: “Can everything be explained to everyone in terms they can understand? In 5 Levels, an expert scientist explains a high-level subject in five different layers of complexity - first to a child, then a teenager, then an undergrad majoring in the same subject, a grad student, and finally, a colleague”.

Curious, I clicked on the link and started learning about exciting new concepts I could finally grasp with my own level of understanding. Music, Biology, Physics, Medicine - all seemed clear that night. Needless to say, I couldn’t stop watching the series and went to bed very very late.

Screenshot of WIRED website, showing the collection of “5-levels” concepts (image: screenshot)

I actually started writing this article while working on a more technical piece. From a few paragraphs in the text, it grew in size until I felt my original article could no longer hold its weight. Could I explain the key concepts to peers and to co-workers, as well as to children and people without mathematical orientation? How far along will readers follow the explanations?

Let’s find out :)

1) Child

Sometimes, when we learn new things, we are told lots of facts and might be shown a drawing or a table with numbers. Seeing a lot of numbers and tables can be confusing, so it would be really helpful if we could reach the same conclusions, just using less of these facts, tables, and numbers.

Principal Component Analysis (or PCA for short) is what we call an _algorithm: _a set of instructions to follow. If we represent all our facts and tables using numbers, following these instructions will allow us to represent them using fewer numbers.

#dimensionality-reduction #mathematics #machine-learning #data-science #linear-algebra #data analysis

1597884060

## Principal Component Analysis - now explained in your own terms

The caption in the online magazine “WIRED” caught my eye one night a few months ago. When I focused my eyes on it, it read: “Can everything be explained to everyone in terms they can understand? In 5 Levels, an expert scientist explains a high-level subject in five different layers of complexity - first to a child, then a teenager, then an undergrad majoring in the same subject, a grad student, and finally, a colleague”.

Curious, I clicked on the link and started learning about exciting new concepts I could finally grasp with my own level of understanding. Music, Biology, Physics, Medicine - all seemed clear that night. Needless to say, I couldn’t stop watching the series and went to bed very very late.

Screenshot of WIRED website, showing the collection of “5-levels” concepts (image: screenshot)

I actually started writing this article while working on a more technical piece. From a few paragraphs in the text, it grew in size until I felt my original article could no longer hold its weight. Could I explain the key concepts to peers and to co-workers, as well as to children and people without mathematical orientation? How far along will readers follow the explanations?

Let’s find out :)

1) Child

Sometimes, when we learn new things, we are told lots of facts and might be shown a drawing or a table with numbers. Seeing a lot of numbers and tables can be confusing, so it would be really helpful if we could reach the same conclusions, just using less of these facts, tables, and numbers.

Principal Component Analysis (or PCA for short) is what we call an _algorithm: _a set of instructions to follow. If we represent all our facts and tables using numbers, following these instructions will allow us to represent them using fewer numbers.

#dimensionality-reduction #mathematics #machine-learning #data-science #linear-algebra #data analysis

1597740600

## Principal Component Analysis - Visualized

If you have ever taken an online course on Machine Learning, you must have come across Principal Component Analysis for dimensionality reduction, or in simple terms, for compression of data. Guess what, I had taken such courses too but I never really understood the graphical significance of PCA because all I saw was matrices and equations. It took me quite a lot of time to understand this concept from various sources. So, I decided to compile it all in one place.

In this article, we will take a visual (graphical) approach to understand PCA and how it can be used to compress data. Basic knowledge of Linear Algebra and Matrices is assumed. If you are new to this concept, just follow along, I have tried my best to keep this as simple as possible.

## Introduction

These days, datasets containing a large number of dimensions are increasingly common and are often difficult to interpret. One example can be a database of face photographs of let’s say, 1,000,000 people. If each face photograph has a dimension of 100x100, then the data of each face is 10000 dimensional (there are 100x100 = 10,000 unique values to be stored for each face). Now, if 1 byte is required to store the information of each pixel, then 10,000 bytes are required to store 1 face. Since there are 1000 faces in the database,10,000 x 1,000,000 = 10 GB will be needed to store the dataset.

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, exploiting the fact that the images in these datasets have something in common. For instance, in a dataset consisting of face photographs, each photograph will have facial features like eyes, nose, mouth. Instead of encoding this information pixel by pixel, we could make a template of each type of these features and then just combine these templates to generate any face in the dataset. In this approach, each template will still be 100x100 = 1000 dimensional, but since we will be reusing these templates (basis functions) to generate each face in the dataset, the number of templates required will be very small. PCA does exactly this.

#principal-component #python #data-science #machine-learning #data-compression #data analysis

1604008800

## Static Code Analysis: What It Is? How to Use It?

Static code analysis refers to the technique of approximating the runtime behavior of a program. In other words, it is the process of predicting the output of a program without actually executing it.

Lately, however, the term “Static Code Analysis” is more commonly used to refer to one of the applications of this technique rather than the technique itself — program comprehension — understanding the program and detecting issues in it (anything from syntax errors to type mismatches, performance hogs likely bugs, security loopholes, etc.). This is the usage we’d be referring to throughout this post.

“The refinement of techniques for the prompt discovery of error serves as well as any other as a hallmark of what we mean by science.”

• J. Robert Oppenheimer

### Outline

We cover a lot of ground in this post. The aim is to build an understanding of static code analysis and to equip you with the basic theory, and the right tools so that you can write analyzers on your own.

We start our journey with laying down the essential parts of the pipeline which a compiler follows to understand what a piece of code does. We learn where to tap points in this pipeline to plug in our analyzers and extract meaningful information. In the latter half, we get our feet wet, and write four such static analyzers, completely from scratch, in Python.

Note that although the ideas here are discussed in light of Python, static code analyzers across all programming languages are carved out along similar lines. We chose Python because of the availability of an easy to use `ast` module, and wide adoption of the language itself.

### How does it all work?

Before a computer can finally “understand” and execute a piece of code, it goes through a series of complicated transformations:

As you can see in the diagram (go ahead, zoom it!), the static analyzers feed on the output of these stages. To be able to better understand the static analysis techniques, let’s look at each of these steps in some more detail:

### Scanning

The first thing that a compiler does when trying to understand a piece of code is to break it down into smaller chunks, also known as tokens. Tokens are akin to what words are in a language.

A token might consist of either a single character, like `(`, or literals (like integers, strings, e.g., `7``Bob`, etc.), or reserved keywords of that language (e.g, `def` in Python). Characters which do not contribute towards the semantics of a program, like trailing whitespace, comments, etc. are often discarded by the scanner.

Python provides the `tokenize` module in its standard library to let you play around with tokens:

Python

1

``````import io
``````

2

``````import tokenize
``````

3

4

``````code = b"color = input('Enter your favourite color: ')"
``````

5

6

``````for token in tokenize.tokenize(io.BytesIO(code).readline):
``````

7

``````    print(token)
``````

Python

1

``````TokenInfo(type=62 (ENCODING),  string='utf-8')
``````

2

``````TokenInfo(type=1  (NAME),      string='color')
``````

3

``````TokenInfo(type=54 (OP),        string='=')
``````

4

``````TokenInfo(type=1  (NAME),      string='input')
``````

5

``````TokenInfo(type=54 (OP),        string='(')
``````

6

``````TokenInfo(type=3  (STRING),    string="'Enter your favourite color: '")
``````

7

``````TokenInfo(type=54 (OP),        string=')')
``````

8

``````TokenInfo(type=4  (NEWLINE),   string='')
``````

9

``````TokenInfo(type=0  (ENDMARKER), string='')
``````

(Note that for the sake of readability, I’ve omitted a few columns from the result above — metadata like starting index, ending index, a copy of the line on which a token occurs, etc.)

#code quality #code review #static analysis #static code analysis #code analysis #static analysis tools #code review tips #static code analyzer #static code analysis tool #static analyzer

1623856080

1593843780

## Simply Explained — Principal Component Analysis

In this blog we will be looking at most important Dimensionality Reduction Technique i.e. Principal Component Analysis.

Source

Using PCA, we can find the correlation between data points, such as whether the Summer effect the sale of ice-cream or by how much. In PCA we will be generating co-variance matrix to check the correlation, but let’s start it from scratch.

As, we said earlier, PCA is Dimensionality Reduction technique, so first take a look how to reduce the dimensions.

## But, Why we need to reduce dimensions?

PCA tries to remove the curse in any ML project, i.e. OVERFITTING. Overfitting is a problem generated when the model is too much accurate in training data, i.e. model perfectly fits all the points on training dataset. The reduce this overfitting, we generate the Principal Components.

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.

Here, in below image, I have plot the best fit(overfitted) model with the data points, given two attributes X & Y. We will be generating the principal components by viewing the model from different directions.

Source: Author

PC1 — First Principal component(generated from view 1)

PC2 — Second Principal component(generated from view 2)

As you can se in the above image, that we tried to reduce 2-Dimension Model to 1-Dimension by generating it’s principal components as per different views. As per the note, the principal components generated should be less than or equal to the total attributes. Also remember, the components generated should hold orthogonal property i.e. each of the components should be independent to each other.

The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent.

#python #principal-component #ai #machine-learning #data-science