In this post, we are going to re-play the classic Multi-Layer Perceptron. Most importantly, we will play the solo called backpropagation, which is, indeed, one of the machine-learning standards.

## Introduction

Writing a custom implementation of a popular algorithm can be compared to playing a musical standard. For as long as the code reflects upon the equations, the functionality remains unchanged. It is, indeed, just like playing from notes. However, it lets you master your tools and practice your ability to hear and think.

In this post, we are going to re-play the classic Multi-Layer Perceptron. Most importantly, we will play the solo called **backpropagation**, which is, indeed, one of the machine-learning standards.

As usual, we are going to show how the *math translates into code*. In other words, we will take the notes (equations) and play them using bare-bone *numpy*.

FYI: Feel free to check another “implemented from scratch” article on Hidden Markov Models here.

## Overture — A Dense Layer

## Data

Let our (most generic) data be described as pairs of question-answer examples:_ (***_x**, y)*, where **X* **∈ ℝ^*m× n* is as a matrix of *feature vectors*, **y*_ ∈ ℝᵐ *is known a matrix of _labels* and *i _refers to an index of a particular data example. Here, by _n _we understand the number of features and is the number of examples, so _i ∈ {1, 2, …, m}.* Also, we assume that *y ∈ {0, 1} _thus posing a _binary classification* problem for us. Here, it is important to mention that the approach won’t be much different if _y _was a multi-class or a continuous variable (regression).

To generate the mock for the data, we use the *sklearn’s* make_classification function.

```
import numpy as np
from sklearn.datasets import make_classification
np.random.seed(42)
X, y = make_classification(n_samples=10, n_features=4,
n_classes=2, n_clusters_per_class=1)
y_true = y.reshape(-1, 1)
```

Note that we do not split the data into the training and test datasets, as our goal would be to construct the network. Therefore, if the model *overfits* it would be a perfect sign!

At this stage, we adopt the convention that `axis=0`

shall refer to the examples *i*, while `axis=1`

will be reserved for the features *j ∈ {1, 2, …, n}*. Naturally, for binary classification dim **_y _***= (m, 1)*.

machine-learning
mathematics
keras
data-science
python