A few weeks ago researchers from Google, the University of Cambridge, DeepMind and the Alan Turin Institute released the paper Rethinking Attention with Performers, which seeks to find a solution to the softmax bottleneck problem in transformers [1]. Their approach exploits a clever mathematical trick, which I will explain in this article.

Prerequisites:

  • Some knowledge of transformers
  • Kernel functions

Topics covered:

  • Why transformers?
  • The problem with transformers
  • Sidestepping the softmax bottleneck

Why transformers?

In essence, the Transformer is a model designed to work efficiently with sequential data, and it is in fact employed heavily in Natural Language Processing (NLP) tasks, which require handling sequences of words/letters. Unlike other sequential models, the transformer exploits attention mechanisms to process sequential data in parallel (ie: not needing to go through one word/input at a time) [2].

#complexity #machine-learning #transformers #nlp #performer

From Transformers to Performers: Approximating Attention
3.10 GEEK