This is my second article on Text summarization. In my first article, I have talked about the extractive approaches to summarize text and the metrics used. In this article, we are going to talk about abstractive summarization. We are going to see how deep learning can be used to summarize the text. So, let’s dive in.

Abstractive Summarizers

Abstractive summarizers are so-called because they do not select sentences from the originally given text passage to create the summary. Instead, they produce a paraphrasing of the main contents of the given text, using a vocabulary set different from the original document. This is very similar to what we as humans do, to summarize. We create a semantic representation of the document in our brains. We then pick words from our general vocabulary (the words we commonly use) that fit in the semantics, to create a short summary that represents all the points of the actual document. As you may notice, developing this kind of summarizer may be difficult as they would need the Natural Language Generation. Let’s look at the most used approach to the problem.

Application of sequence-to-sequence RNNs

The approach was proposed in a paper by Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, Bing Xiang from IBM. The term “sequence to sequence models” is used because the models are designed to create an output sequence of words from an input sequence of words. The input sequence in the considered case is the actual text document and the output sequence is the shortened summary.

The paper proposes a model inspired by an attentional Recurrent Neural Network encoder-decoder model which was first proposed for machine translation by Dzmitry Bahdanau, Jacob’s University, Germany.

Though, the problems are a lot different as you can already sense. Firstly for machine translation, we need the translation to be loss-less as we want the exact sentence in a translated form, but for Summary Generation, we need to compress the original document, to create the summary, so it needs to be a bit lossy. Secondly, for a summary generation, the length of the summary does not depend on the original text. These two points are the key challenges in the problem as given by the problem.

Before jumping into the application details of the paper let’s look at the encoder and decoder networks and the reason for using the attention layer.

Encoder and Decoder Networks

If we consider a general LSTM( Long term short memory) layer, it looks something like the diagram given below. It either produces an output for every input or it creates a feature vector, which is later used by dense neural network layers for classification tasks with the application of softmax layers. For example, sentiment detection, where we pass the whole sentence through RNN and use the feature vectors which are fit to softmax layers for producing the ultimate result.

Image for post

But one thing to realize here is, for the current problem, or problems like this including machine translation, more generally speaking, where we can say the problem is a sequence to sequence problem, this particular model approach cannot be applied. The main reason is that the size of the output is independent of the size of the input and both are sequences. To deal with this problem the encoder-decoder network model was introduced.

Image for post

The model basic architecture is described by the above diagram.

The encoder is responsible for taking in the input sentence or original document and generate a final state vector(hidden state and cell state). This is represented by the internal state in the diagram. The Encoder may contain LSTM layers, RNN, or GRU layers. Mostly LSTM layers are used due to the removed Exploding and vanishing gradient problem.

#abstractive-summarization #text-summarization #deep-learning #deep learning

Understanding Automatic Text Summarization-2: Abstractive Methods
1.30 GEEK