An Introduction to Natural Language Processing (NLP) Terms

Introducing more NLP terms

In an earlier blog I gave an introduction to NLP, how it works, and some beginning terms. In this blog, I’ll add more terms.

Tokenize

This means splitting a document into parts of language. Usually, this means splitting into words, but we can also tokenize sentences, or even individual letters or characters. We tokenize our text to create word embeddings.

“This” “sentence” “is” “tokenized” “by” “words”

Stop Words

Stop words are frequently occurring words that don’t add to the meaning of our text. Words like “the”, “of”, “and”, “a”, “to”. We often want to remove these words to reduce the size of the text document and make sure that meaningful words have more impact. The top 25 words in the English language make up almost a third of all written material.* Removing stop words is an easy way to make a dramatic size reduction in our text document.

However, we don’t always want to remove all stop words. Words that are negations often reverse the meaning of the sentence, so it may not make sense to remove them. The most common negations are: “not”, “no”, “don’t”, “never”, “didn’t”. Even words like “hardly”, “seldom” “a little” can change the meaning of a sentence.

N-grams

N-grams are word combinations that make more sense when they are grouped than when they are separated. For example, the words “Los” and “Angeles” have a much more specific meaning when joined together as “Los Angeles”. Los Angeles is two words — so is known as a bi-gram. In the same way, New York City is a tri-gram.

#nlp #natural-language-process #deep-learning #machine-learning

Tokenize

Stop Words

N-grams

medium.com

An Introduction to Natural Language Processing (NLP) Terms