Cloze-driven Pretraining of Self-attention Networks is a new type of language pre-training published Facebook AI Research by to get improved word embedding for fine-tuning using transformer modules. As this is a paper dissection session, I will try to explain the main idea of the paper section-wise both intuitively and mathematically if needed for a better understanding of the readers. It would be great if the readers open both this article and the paper side by side and start going through together. I will skip the Abstract, Introduction, and Related work section and start from section 3.

Prerequisites

Knowledge on Transformers. Along with the original paper this blog from Jay Alammar is very helpful in understanding transformers.

3. Two Tower Model:

In this section, the paper reveals their novel architecture to pretrain word embeddings. Before directly jumping to the architecture, let’s build the intuition of this pretraining.

Now, what is Cloze reading? Cloze reading is an instructional strategy where users are required to fill in the blanks within a passage with correct words from a word bank. For example, given a sentence, **This is my first medium article, **the pretraining idea is to predict **my, **given **This is **and **first medium article. **Makes sense? If not, do not worry. In a moment I will present a pictorial diagram that will make you super clear.

Let’s come to the two tower analogy. For now, assume these two towers are two black boxes. As the word/token **my **is in between phrases This is and first medium article, **This is **will go the left tower and **first medium article **will go to the right tower as inputs to finally predict **my. **This left tower or forward tower works left to right which means given **This is, **it tries to predict **my, **where the right or backward tower works from right to left which means given **article medium first, **it tries to predict **my. **Sentences are appended with ~~token at the beginning and end. As the input sentences for both the towers are not equal in length masking needs to be done.~~

3.1 Block Structure

Figure 1: Model Architecture for Cloze-pretraining. Source: Original Paper

In this section, I will talk in detail about the towers. These towers are the Transformer decoder blocks stacked on top of each other as shown in Figure 1. The green blocks are the part of the forward tower and the blue blocks are the part of the backward tower. From the given figure it is seen that given, **~~, a **in the forward tower and **c, **in the backward tower **b **is desired to be predicted finally.~~

This paper used one different kind of word embeddings using CNN encodings. The details of the encoding can be found here. In short, given a word input it is broken into characters and character embeddings are generated. On top of this, Conv1D layers of different filter size areused, and different sizes of outputs are obtained. After this, a max-over-time pooling operation is applied to obtain a fixed-dimensional representation of the word, which is given to the highway network to get the final word embeddings. This process is shown in Figure 2.

#nlp #deep learning

Prerequisites

3. Two Tower Model:

3.1 Block Structure

towardsdatascience.com

Paper Dissection: Cloze-driven Pretraining of Self-attention Networks