Building a State-of-the-art Text Classifier for any language you want!

Building a State-of-the-art Text Classifier for any language you want!

Text Classification is a notorious problem in the field of NLP. However, the dawn of the ‘NLP evolution’ in recent years has made it easy to tackle such a problem without needing an expertise in the field. Recent advancements have helped us realize that models can act as better classifiers if they understand the language first(language modelling).

Overview

The idea is to first train(or use a pretrained model, if available) a language model on the wikipedia dataset of that language that can accurately predict the next word given a set of words( kind of what the keyboards on our phones do when they suggest a word). We then use that model to classify reviews, tweets, articles etc and amazingly with a few tweaks, we can build a state-of-the-art model for text classification in that language. For purpose of this article, I’ll be building the language model for Hindi language and use it to classify reviews/articles.

The method used in building this model is ULMFiT(Universal Language Model Fine-Tuning for Text Classification). The underlying concepts behind ULMFit are intricate and complex and the explaining its details would be better in a different article. However, the fastaiv1 library makes the task of language modelling and text classification quite easy and straightforward(requires less than 20 lines of code!!).

ULMFiT is also described by Jeremy Howard in in his mooc’s and can be found at these links. All the code mentioned below is available on my github.

The Wikipedia Dataset

We start off by downloading and cleaning the Wikipeda articles written in Hindi.

%reload_ext autoreload
%autoreload 2
%matplotlib inline

## importing the required libraries
from fastai import *
from fastai.text import *
torch.cuda.set_device(0)
## Initializing variables
## each lang has its code which is defined here(under the colunmn
## 'wiki': https://meta.wikimedia.org/wiki/List_of_Wikipedias
data_path = Config.data_path()
lang = 'hi'
name = f'{lang}wiki'
path = data_path/name
path.mkdir(exist_ok=True, parents=True) ## create directory
lm_fns = [f'{lang}_wt', f'{lang}_wt_vocab']

Now let’s download the articles from wikipedia. Wikipedia mantains a List of Wikipedias that contains information regarding the number of articles present in a particular language, number of edits, and depth. The ‘depth’ column (defined as [Edits/Articles] × [Non-Articles/Articles] × [1 − Stub-ratio] ) is a rough indicator of a Wikipedia’s quality, showing how frequently its articles are updated. The higher the depth, the greater would be the quality of the articles.

deep-learning nlp naturallanguageprocessing fastai machine-learning

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Hire Machine Learning Developers in India

We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.

Evolution of NLP : Introduction to Transfer Learning for NLP

Introduction to Transfer Learning for NLP using fast.ai. This is the third part of a series of posts showing the improvements in NLP modeling approaches.

What is Supervised Machine Learning

What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI

Pros and Cons of Machine Learning Language

AI, Machine learning, as its title defines, is involved as a process to make the machine operate a task automatically to know more join CETPA

GPT-3, a Giant Step for Deep Learning And NLP

Can intelligence emerge simply by training a big enough language model using lots of data? OpenAI tries to do so, using 175 billion parameters.