Language identification can be an important step in a Natural Language Processing (NLP) problem. It involves trying to predict the natural language of a piece of text. It is important to know the language of text before other actions (i.e. translation/ sentiment analysis) can be taken. For instance, if you go to google translate the box you type in says ‘Detect Language’. This is because Google is first trying to identify the language of your sentence before it can be translated.

Image for post

There are several different approaches to language identification and, in this article, we’ll explore one in detail. That is using a Neural Network and character n-grams as features. In the end, we show that an accuracy of over 98% can be achieved with this approach. Along the way, we will discuss key pieces of code and you can find the full project on GitHub. Firstly, we’ll discuss the dataset that we’ll use to train our Neural Network.

Dataset

The dataset is provided by Tatoeba.The full dataset consists of 6,872,356 sentences in 328 unique languages. To simplify our problem we will consider:

  • 6 Latin languages: English, German, Spanish, French, Portuguese and Italian.
  • Sentences between 20 and 200 characters long.

We can see an example of a sentence from each language in Table 1. Our objective is to create a model that can predict the Target Variable using the Text provided.

#neural-networks #deep-learning #machine-learning #artificial-intelligence #towards-data-science #deep learning

Deep Neural Network Language Identification
4.20 GEEK