A Comprehensive Guide To Fine-Tuning BERT For Text Classification And SQuAD Tasks. Fine Tuning BERT for Text Classification and Question Answering using TensorFlow Framework
Google BERT (Bidirectional** Encoder Representations from **Transformers) and other transformer-based models further improved the state of the art on eleven natural language processing tasks under broad categories of single text classification (e.g., sentiment analysis), text pair classification (e.g., natural language inference), question answering ( like SQuAD 1.1) and text tagging (e.g., named entity recognition).
BERT model is based on a few key ideas:
- Аttention only model without RNNs (LSTM/GRU) is computationally more attractive (parallel rather than sequential processing of input) and even has better performance (ability remember information beyond just about 100+ words_) _than RNNs.
- BERT uses an idea of representing words as subwords or ngrams. On average a vocab of 8k to 30k ngrams can represent any word in a large corpus. This has a significant advantage from memory perspective.
- Eliminates the need for task specific architectures. A pre-trained BERT model can be used as is for a wide variety of NLP tasks with fine-tuning. This avoids the need for task specific architectures (like ELMo) that we needed before — for example, a model forQ&A would have a very different architecture from a model that solved NER.
- Word2vec and Glove word embeddings are context independent — these models output just one vector (embedding) for each word, combining all the different senses of the word into one vector. Given the abundance of polysemy and complex semantics in natural languages, context-independent representations have obvious limitations. For instance, the word crane *in contexts *a crane is flying and a crane driver came has completely different meanings; thus, the same word may be assigned different representations depending on contexts. BERT can generate different word embeddings for a word that captures the context of a word — that is its position in a sentence.
- Unlike the GPT model, which also represents an effort in designing a general task-agnostic model for context-sensitive representations, BERT encodes context bidirectionally, while due to the autoregressive nature of language models, GPT only looks forward (left-to-right).
- Transfer learning. This advantage has nothing directly to do with the model architecture — but the fact that these models are trained on a language modeling task (and other tasks too in the case of BERT_)_ they can be used for downstream tasks which have very little labeled data. During supervised learning of downstream tasks, BERT is similar to GPT in two aspects. First, BERT representations will be fed into an added output layer, with minimal changes to the model architecture depending on nature of tasks, such as predicting for every token vs. predicting for the entire sequence. Second, all the parameters of the pretrained Transformer encoder are fine-tuned, while the additional output layer will be trained from scratch.