A significant portion of the data that is generated today is unstructured. Unstructured data includes social media comments, browsing history and customer feedback. Have you found yourself in a situation with a bunch of textual data to analyse, and no idea how to proceed? Natural language processing in Python can help.
The objective of this tutorial is to enable you to analyze textual data in Python through the concepts of Natural Language Processing (NLP). You will first learn how to tokenize your text into smaller chunks, normalize words to their root forms, and then, remove any noise in your documents to prepare them for further analysis.
Let’s get started!
In this tutorial, we will use Python’s
nltk library to perform all NLP operations on the text. At the time of writing this tutorial, we used version 3.4 of
nltk. To install the library, you can use the
pip command on the terminal:
pip install nltk==3.4
To check which version of
nltk you have in the system, you can import the library into the Python interpreter and check the version:
import nltk print(nltk.__version__)
To perform certain actions within
nltk in this tutorial, you may have to download specific resources. We will describe each resource as and when required.
However, if you would like to avoid downloading individual resources later in the tutorial and grab them now in one go, run the following command:
python -m nltk.downloader all
A computer system can not find meaning in natural language by itself. The first step in processing natural language is to convert the original text into tokens. A token is a combination of continuous characters, with some meaning. It is up to you to decide how to break a sentence into tokens. For instance, an easy method is to split a sentence by whitespace to break it into individual words.
In the NLTK library, you can use the
word_tokenize() function to convert a string to tokens. However, you will first need to download the
punkt resource. Run the following command in the terminal:
Next, you need to import
nltk.tokenize to use it.
from nltk.tokenize import word_tokenize print(word_tokenize("Hi, this is a nice hotel."))
The output of the code is as follows:
['Hi', ',', 'this', 'is', 'a', 'nice', 'hotel', '.']
You’ll notice that
word_tokenize does not simply split a string based on whitespace, but also separates punctuation into tokens. It’s up to you if you would like to retain the punctuation marks in the analysis.
When you are processing natural language, you’ll often notice that there are various grammatical forms of the same word. For instance, “go”, “going” and “gone” are forms of the same verb, “go”.
While the necessities of your project may require you to retain words in various grammatical forms, let us discuss a way to convert various grammatical forms of the same word into its base form. There are two techniques that you can use to convert a word to its base.
The first technique is stemming. Stemming is a simple algorithm that removes affixes from a word. There are various stemming algorithms available for use in NLTK. We will use the Porter algorithm in this tutorial.
We first import
nltk.stem.porter. Next, we initialize the stemmer to the
stemmer variable and then use the
.stem() method to find the base form of a word.
from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("going"))
The output of the code above is
go. If you run the stemmer for the other forms of “go” described above, you will notice that the stemmer returns the same base form, “go”. However, as stemming is only a simple algorithm based on removing word affixes, it fails when the words are less commonly used in language.
When you try the stemmer on the word “constitutes”, it gives an unintuitive result.
You will notice the output is “constitut”.
This issue is solved by moving on to a more complex approach towards finding the base form of a word in a given context. The process is called lemmatization. Lemmatization normalizes a word based on the context and vocabulary of the text. In NLTK, you can lemmatize sentences using the
First, you need to download the
wordnet resource from the NLTK downloader in the Python terminal.
Once it is downloaded, you need to import the
WordNetLemmatizer class and initialize it.
from nltk.stem.wordnet import WordNetLemmatizer lem = WordNetLemmatizer()
To use the lemmatizer, use the
.lemmatize() method. It takes two arguments — the word and the context. In our example, we will use “v” for context. Let us explore the context further after looking at the output of the
You would notice that the
.lemmatize() method correctly converts the word “constitutes” to its base form, “constitute”. You would also notice that lemmatization takes longer than stemming, as the algorithm is more complex.
Let’s check how to determine the second argument of the
.lemmatize() method programmatically. NLTK has a
pos_tag function which helps in determining the context of a word in a sentence. However, you first need to download the
averaged_perceptron_tagger resource through the NLTK downloader.
Next, import the
pos_tag function and run it on a sentence.
from nltk.tag import pos_tag sample = "Hi, this is a nice hotel." print(pos_tag(word_tokenize(sample)))
You will notice that the output is a list of pairs. Each pair consists of a token and its tag, which signifies the context of a token in the overall text. Notice that the tag for a punctuation mark is itself.
[('Hi', 'NNP'), (',', ','), ('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('nice', 'JJ'), ('hotel', 'NN'), ('.', '.')]
How do you decode the context of each token? Here is a full list of all tags and their corresponding meanings on the web. Notice that the tags of all nouns begin with “N”, and for all verbs begin with “V”. We can use this information in the second argument of our
def lemmatize_tokens(stentence): lemmatizer = WordNetLemmatizer() lemmatized_tokens =  for word, tag in pos_tag(stentence): if tag.startswith('NN'): pos = 'n' elif tag.startswith('VB'): pos = 'v' else: pos = 'a' lemmatized_tokens.append(lemmatizer.lemmatize(word, pos)) return lemmatized_tokens sample = "Legal authority constitutes all magistrates." print(lemmatize_tokens(word_tokenize(sample)))
The output of the code above is as follows:
['Legal', 'authority', 'constitute', 'all', 'magistrate', '.']
This output is on expected grounds, where “constitutes” and “magistrates” have been converted to “constitute” and “magistrate”, respectively.
The next step in preparing data is to clean the data and remove anything that does not add meaning to your analysis. Broadly, we will look at removing punctuation and stop words from your analysis.
Removing punctuation is a fairly easy task. The
punctuation object of the
string library contains all the punctuation marks in English.
import string print(string.punctuation)
The output of this code snippet is as follows:
In order to remove punctuation from tokens, you can simply run:
for token in tokens: if token in string.punctuation: # Do something
Next, we will focus on removing stop words. Stop words are commonly used words in language like “I”, “a” and “the”, which add little meaning to text when analyzing it. We will therefore, remove stop words from our analysis. First, download the
stopwords resource from the NLTK downloader.
Once your download is complete, import
nltk.corpus and use the
.words() method with “english” as the argument. It is a list of 179 stop words in the English language.
from nltk.corpus import stopwords stop_words = stopwords.words('english')
We can combine the lemmatization example with the concepts discussed in this section to create the following function,
clean_data(). Additionally, before comparing if a word is a part of the stop words list, we convert it to the lower case. This way, we still capture a stop word if it occurs at the start of a sentence and is capitalized.
def clean_data(tokens, stop_words = ()): cleaned_tokens =  for token, tag in pos_tag(tokens): if tag.startswith("NN"): pos = 'n' elif tag.startswith('VB'): pos = 'v' else: pos = 'a' lemmatizer = WordNetLemmatizer() token = lemmatizer.lemmatize(token, pos) if token not in string.punctuation and token.lower() not in stop_words: cleaned_tokens.append(token) return cleaned_tokens sample = "The quick brown fox jumps over the lazy dog." stop_words = stopwords.words('english') clean_data(word_tokenize(sample), stop_words)
The output of the example is as follows:
['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']
As you can see, the punctuation and stop words have been removed.
Now that you are familiar with the basic cleaning techniques in NLP, let’s try and find the frequency of words in text. For this exercise, we’ll use the text of the fairy tale, The Mouse, The Bird and The Sausage, which is available freely on Gutenberg. We’ll store the text of this fairy tale in a string,
First, we tokenize
text and then clean it using the function
clean_data that we defined above.
tokens = word_tokenize(text) cleaned_tokens = clean_data(tokens, stop_words = stop_words)
To find the frequency distribution of words in your text, you can use
FreqDist class of NLTK. Initialize the class with the tokens as an argument. Then use the
.most_common() method to find the commonly occurring terms. Let us try and find the top ten terms in this case.
from nltk import FreqDist freq_dist = FreqDist(cleaned_tokens) freq_dist.most_common(10)
Here are the ten most commonly occurring terms in this fairy tale.
python [('bird', 15), ('sausage', 11), ('mouse', 8), ('wood', 7), ('time', 6), ('long', 5), ('make', 5), ('fly', 4), ('fetch', 4), ('water', 4)]
Unsurprisingly, the three most common terms are the three main characters in the fairy tale.
The frequency of words may not be very important when analysing text. Typically, the next step in NLP is to generate a statistic — TF – IDF (term frequency – inverse document frequency), which signifies the importance of a word in a list of documents.
In this post, you were introduced to natural language processing in Python. You converted text to tokens, converted words to their base forms and finally, cleaned the text to remove any part which didn’t add meaning to the analysis.
Although you looked at simple NLP tasks in this tutorial, there are many techniques to explore. One may wish to perform topic modelling on textual data, where the objective is to find a common topic that a text might be talking about. A more complex task in NLP is the implementation of a sentiment analysis model to determine the feeling behind any text.
What procedures do you follow when you are given a pile of text to work with? Let us know in the comments below.
Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.
#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners
This video will provide you with a comprehensive and detailed knowledge of Natural Language Processing, popularly known as NLP. You will also learn about the different steps involved in processing the human language like Tokenization, Stemming, Lemmatization and more. Python, NLTK, & Jupyter Notebook are used to demonstrate the concepts.
📺 The video in this post was made by freeCodeCamp.org
The origin of the article: https://www.youtube.com/watch?v=X2vAabgKiuM&list=PLWKjhJtqVAbnqBxcdjVGgT3uVR10bzTEB&index=16
🔥 If you’re a beginner. I believe the article below will be useful to you ☞ What You Should Know Before Investing in Cryptocurrency - For Beginner
⭐ ⭐ ⭐The project is of interest to the community. Join to Get free ‘GEEK coin’ (GEEKCASH coin)!
☞ **-----CLICK HERE-----**⭐ ⭐ ⭐
Thanks for visiting and watching! Please don’t forget to leave a like, comment and share!
#natural language processing #nlp #python #python & nltk #nltk #natural language processing (nlp) tutorial with python & nltk
Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.
Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is
Syntax: x = lambda arguments : expression
Now i will show you some python lambda function examples:
#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map
Python is awesome, it’s one of the easiest languages with simple and intuitive syntax but wait, have you ever thought that there might ways to write your python code simpler?
In this tutorial, you’re going to learn a variety of Python tricks that you can use to write your Python code in a more readable and efficient way like a pro.
Swapping value in Python
Instead of creating a temporary variable to hold the value of the one while swapping, you can do this instead
>>> FirstName = "kalebu" >>> LastName = "Jordan" >>> FirstName, LastName = LastName, FirstName >>> print(FirstName, LastName) ('Jordan', 'kalebu')
#python #python-programming #python3 #python-tutorials #learn-python #python-tips #python-skills #python-development
Today you’re going to learn how to use Python programming in a way that can ultimately save a lot of space on your drive by removing all the duplicates.
In many situations you may find yourself having duplicates files on your disk and but when it comes to tracking and checking them manually it can tedious.
Heres a solution
Instead of tracking throughout your disk to see if there is a duplicate, you can automate the process using coding, by writing a program to recursively track through the disk and remove all the found duplicates and that’s what this article is about.
But How do we do it?
If we were to read the whole file and then compare it to the rest of the files recursively through the given directory it will take a very long time, then how do we do it?
The answer is hashing, with hashing can generate a given string of letters and numbers which act as the identity of a given file and if we find any other file with the same identity we gonna delete it.
There’s a variety of hashing algorithms out there such as
#python-programming #python-tutorials #learn-python #python-project #python3 #python #python-skills #python-tips