From Transformers to Performers: Approximating Attention

A few weeks ago researchers from Google, the University of Cambridge, DeepMind and the Alan Turin Institute released the paper Rethinking Attention with Performers, which seeks to find a solution to the softmax bottleneck problem in transformers [1]. Their approach exploits a clever mathematical trick, which I will explain in this article.

Prerequisites:

Some knowledge of transformers
Kernel functions

Topics covered:

Why transformers?
The problem with transformers
Sidestepping the softmax bottleneck

Why transformers?

In essence, the Transformer is a model designed to work efficiently with sequential data, and it is in fact employed heavily in Natural Language Processing (NLP) tasks, which require handling sequences of words/letters. Unlike other sequential models, the transformer exploits attention mechanisms to process sequential data in parallel (ie: not needing to go through one word/input at a time) [2].

#complexity #machine-learning #transformers #nlp #performer

Why transformers?

towardsdatascience.com

From Transformers to Performers: Approximating Attention