Benchmarking Language Detection for NLP

Most NLP applications tend to be language-specific and therefore require monolingual data. In order to build an application in your target language, you may need to apply a preprocessing technique that filters out text written in non-target languages. This requires proper identification of the language of each input example. Below I list some tools you can use as Python modules for this preprocessing requirement, and provide a performance benchmark assessing the speed and accuracy of each one.

langdetect
spaCy language detecto
langid
FastText

1) langdetect

langdetect is a re-implementation of Google’s language-detection library from Java to Python. Simply pass your text to the imported detect function and it will output the two-letter ISO 693 code of the language for which the model gave the highest confidence score. (Refer to this page for a full list of 693 codes and their respective languages.) If you use detect_langs instead, it will output a list of the top languages that the model has predicted, along with their probabilities.

from langdetect import DetectorFactory, detect, detect_langs

text = "My lubimy mleko i chleb."
detect(text) ##  'cs'
detect_langs(text)  ## [cs:0.7142840957132709, pl:0.14285810606233737, sk:0.14285779665739756]

A few embellishing points:

The library makers recommend that you set the DetectorFactory seed to some number. This is because langdetect’s algorithm is non-deterministic, which means if you try to run it on a text that’s too short or too ambiguous, you might get different results each time you run it. Setting the seed enforces consistent results during development/evaluation.
You may also want to surround the detect call in a try/except block with LanguageDetectException, otherwise you will likely get a “No features in text” error, which occurs when there the language of the given input cannot be evaluated as when it contains strings like URLs, numbers, formulas, etc.

from langdetect import DetectorFactory, detect
from langdetect.lang_detect_exception import LangDetectException

DetectorFactory.seed = 0
def is_english(text):
    try:
        if detect(text) != "en":
            return False
    except LangDetectException:
        return False
    return True

#data-science #python #programming #machine-learning #developer

1) langdetect

towardsdatascience.com

Benchmarking Language Detection for NLP