Comparing code snippet for text normalization in NLTK and Spacy

Why Text Normalization?

Most NLP tasks require us to refer to a dictionary to teach the machine the word’s context or vocabulary, it is locally to think that, the smaller the vocabulary the better the performance of our NLP task.

Every NLP pipeline needs to do text normalization. Text normalization is the process of transforming a text into a canonical (standard) form. It is one of the important steps in text preprocessing to reduce the noises generated by a single word with multiple forms. For example: Connect, connected, connects all refer to the word “connect”, it is hence easier for us to search for 1 word in the dictionary than searching for 3 words.

2 important task in text normalization are:

Stemming - The process of reducing a word to its stem or root format.

_Lemmatization _ - The transformation that uses a dictionary to map a word’s variant back to its root format

#spacy #python #data-science #machine-learning #developer

Text Normalization With spaCy and NLTK
8.55 GEEK