Natural language processing is one of the fastest growing fields in the world. NLP is making its way into a number of products and services that we use in our day to day life. Most important stages of a NLP pipeline are text processing and cleaning including Stemming and Lemmatization .

Natural Language Processing (NLP)

Textual data can come from a wide variety of sources like the world wide web, PDFs, word documents, speech recognition systems, book scans, optical character readers (OCR), etc. Our goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to our task.

Some features of language like punctuation, capitalization and common words such as “a”, “of”, and “the”, often help provide structure to the document, but don’t add much to the meaning. So, its best to remove them before analysing the textual data and feeding that into our natural language processing pipeline.


What is actually Stemming?

Stemming is the process of reducing a word to its stem or root format. Let us take an example. Consider three words, “branched”, “branching” and “branches”. They all can be reduced to the same word “branch”. After all, all the three convey the same idea of something separating into multiple paths or branches. Again, this helps reduce complexity while retaining the essence of meaning carried by these three words.

Image for post

Image by Justin Case on Unsplash

Stemming, on the other hand, is meant to be a fast and crude operation carried out by applying very simple style rules of search and replace.

Another example is, the suffixes “ing” and “ed” can be dropped off and “ies” can be replaced by “y”. By following these approaches, we may get words that are not complete words, but its okay. Because, all forms of that word in the corpus are reduced to the same form. Thus, capturing the common underlying idea.

words = ['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']

Further, NLTK or Natural Language Toolkit has a few different stemmers for us to choose from like the PorterStemmer, that we are using here, Snowball Stemmer and other language specific stemmers.

#ai #machine-learning #data-science #nlp #artificial-intelligence

Introduction to Stemming vs Lemmatization (NLP)
1.50 GEEK