If you’ve taken an introductory Machine Learning class, you’ve certainly come across the issue of overfitting and been introduced to the concept of regularization and norm. I often see this being discussed purely by looking at the formulas, so I figured I’d try to give a better insight into why exactly minimising the norm induces regularization — and how L1 and L2 differ from each other — using some visual examples.

Prerequisite knowledge

  • Linear regression
  • Gradient descent
  • Some understanding of overfitting and regularization

Topics covered

  • Why does minimizing the norm induce regularization?
  • What’s the difference between the L1 norm and the L2 norm?

Recap of regularization

Using the example of linear regression, our loss is given by the Mean Squared Error (MSE):

Image for post

and our goal is to minimize this loss:

Image for post

To prevent overfitting, we want to add abias towards less complex functions.That is, given two functions that can fit our data reasonably well, we prefer the simpler one. We do this by adding a regularization term, typically either the L1 norm or the squared L2 norm:

Image for post

#overfitting #data-science #regularization #machine-learning #norms

Visualizing regularization and the L1 and L2 norms
8.40 GEEK