In this article, you are going to see different techniques for removing stop words from strings in Python. Stop words are those words in natural language that have a very little meaning, such as “is”, “an”, “the”, etc. Search engines and other enterprise indexing platforms often filter the stop words while fetching results from the database against the user queries.
Stop words are often removed from the text before training deep learning and machine learning models since stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.
With the Python programming language, you have a myriad of options to use in order to remove stop words from strings. You can either use one of the several natural language processing libraries such as NLTK, SpaCy, Gensim, TextBlob, etc., or if you need full control on the stop words that you want to remove, you can write your own custom script.
In this article you will see a number of different the approaches, depending on the NLP library you’re using.
The NLTK library is one of the oldest and most commonly used Python libraries for Natural Language Processing. NLTK supports stop word removal, and you can find the list of stop words in the corpus
module. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK.
Let’s see a simple example:
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]
print(tokens_without_sw)
In the script above, we first import the stopwords
collection from the nltk.corpus
module. Next, we import the word_tokenize()
method from the nltk.tokenize
class. We then create a variable text
, which contains a simple sentence. The sentence in the text
variable is tokenized (divided into words) using the word_tokenize()
method. Next, we iterate through all the words in the text_tokens
list and checks if the word exists in the stop words collection or not. If the word doesn’t exist in the stopword collection, it is returned and appended to the tokens_without_sw
list. The tokens_without_sw
list is then printed.
Here is how the sentence looks without the stop words:
['Nick', 'likes', 'play', 'football', ',', 'however', 'fond', 'tennis', '.']
You can see that the words to
, he
, is
, not
, and too
have been removed from the sentence.
You can join the list of above words to create a sentence without stop words, as shown below:
filtered_sentence = (" ").join(tokens_without_sw)
print(filtered_sentence)
Here is the output:
Nick likes play football , however fond tennis .
You can add or remove stop words as per your choice to the existing collection of stop words in NLTK. Before removing or adding stop words in NLTK, let’s see the list of all the English stop words supported by NLTK:
print(stopwords.words('english'))
Output:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
To add a word to NLTK stop words collection, first create an object from the stopwords.words('english')
list. Next, use the append()
method on the list to add any word to the list.
The following script adds the word play
to the NLTK stop word collection. Again, we remove all the words from our text
variable to see if the word play
is removed or not.
all_stopwords = stopwords.words('english')
all_stopwords.append('play')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
Output:
['Nick', 'likes', 'football', ',', 'however', 'fond', 'tennis', '.']
The output shows that the word play
has been removed.
You can also add a list of words to the stopwords.words
list using the append
method, as shown below:
sw_list = ['likes','play']
all_stopwords.extend(sw_list)
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
The script above adds two words likes
and play
to the stopwords.word
list. In the output, you will not see these two words as shown below:
Output:
['Nick', 'football', ',', 'however', 'fond', 'tennis', '.']
Since stopwords.word('english')
is merely a list of items, you can remove items from this list like any other list. The simplest way to do so is via the remove()
method. This is helpful for when your application needs a stop word to not be removed. For example, you may need to keep the word not
in a sentence to know when a statement is being negated.
The following script removes the stop word not
from the default list of stop words in NLTK:
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
Output:
['Nick', 'likes', 'play', 'football', ',', 'however', 'not', 'fond', 'tennis', '.']
From the output, you can see that the word not
has not been removed from the input sentence.
The Gensim library is another extremely useful library for removing stop words from a string in Python. All you have to do is to import the remove_stopwords()
method from the gensim.parsing.preprocessing
module. Next, you need to pass your sentence from which you want to remove stop words, to the remove_stopwords()
method which returns text string without the stop words.
Let’s take a look at a simple example of how to remove stop words via the Gensim library.
from gensim.parsing.preprocessing import remove_stopwords
text = "Nick likes to play football, however he is not too fond of tennis."
filtered_sentence = remove_stopwords(text)
print(filtered_sentence)
Output:
Nick likes play football, fond tennis.
It is important to mention that the output after removing stop words using the NLTK and Gensim libraries is different. For example, the Gensim library considered the word however
to be a stop word while NLTK did not, and hence didn’t remove it. This shows that there is no hard and fast rule as to what a stop word is and what it isn’t. It all depends upon the task that you are going to perform.
In a later section, you will see how to add or remove stop words to an existing collection of stop words in Gensim.
Let’s first take a look at the stop words in Python’s Gensim library:
import gensim
all_stopwords = gensim.parsing.preprocessing.STOPWORDS
print(all_stopwords)
Output:
frozenset({'her', 'during', 'among', 'thereafter', 'only', 'hers', 'in', 'none', 'with', 'un', 'put', 'hence', 'each', 'would', 'have', 'to', 'itself', 'that', 'seeming', 'hereupon', 'someone', 'eight', 'she', 'forty', 'much', 'throughout', 'less', 'was', 'interest', 'elsewhere', 'already', 'whatever', 'or', 'seem', 'fire', 'however', 'keep', 'detail', 'both', 'yourselves', 'indeed', 'enough', 'too', 'us', 'wherein', 'himself', 'behind', 'everything', 'part', 'made', 'thereupon', 'for', 'nor', 'before', 'front', 'sincere', 'really', 'than', 'alone', 'doing', 'amongst', 'across', 'him', 'another', 'some', 'whoever', 'four', 'other', 'latterly', 'off', 'sometime', 'above', 'often', 'herein', 'am', 'whereby', 'although', 'who', 'should', 'amount', 'anyway', 'else', 'upon', 'this', 'when', 'we', 'few', 'anywhere', 'will', 'though', 'being', 'fill', 'used', 'full', 'thru', 'call', 'whereafter', 'various', 'has', 'same', 'former', 'whereas', 'what', 'had', 'mostly', 'onto', 'go', 'could', 'yourself', 'meanwhile', 'beyond', 'beside', 'ours', 'side', 'our', 'five', 'nobody', 'herself', 'is', 'ever', 'they', 'here', 'eleven', 'fifty', 'therefore', 'nothing', 'not', 'mill', 'without', 'whence', 'get', 'whither', 'then', 'no', 'own', 'many', 'anything', 'etc', 'make', 'from', 'against', 'ltd', 'next', 'afterwards', 'unless', 'while', 'thin', 'beforehand', 'by', 'amoungst', 'you', 'third', 'as', 'those', 'done', 'becoming', 'say', 'either', 'doesn', 'twenty', 'his', 'yet', 'latter', 'somehow', 'are', 'these', 'mine', 'under', 'take', 'whose', 'others', 'over', 'perhaps', 'thence', 'does', 'where', 'two', 'always', 'your', 'wherever', 'became', 'which', 'about', 'but', 'towards', 'still', 'rather', 'quite', 'whether', 'somewhere', 'might', 'do', 'bottom', 'until', 'km', 'yours', 'serious', 'find', 'please', 'hasnt', 'otherwise', 'six', 'toward', 'sometimes', 'of', 'fifteen', 'eg', 'just', 'a', 'me', 'describe', 'why', 'an', 'and', 'may', 'within', 'kg', 'con', 're', 'nevertheless', 'through', 'very', 'anyhow', 'down', 'nowhere', 'now', 'it', 'cant', 'de', 'move', 'hereby', 'how', 'found', 'whom', 'were', 'together', 'again', 'moreover', 'first', 'never', 'below', 'between', 'computer', 'ten', 'into', 'see', 'everywhere', 'there', 'neither', 'every', 'couldnt', 'up', 'several', 'the', 'i', 'becomes', 'don', 'ie', 'been', 'whereupon', 'seemed', 'most', 'noone', 'whole', 'must', 'cannot', 'per', 'my', 'thereby', 'so', 'he', 'name', 'co', 'its', 'everyone', 'if', 'become', 'thick', 'thus', 'regarding', 'didn', 'give', 'all', 'show', 'any', 'using', 'on', 'further', 'around', 'back', 'least', 'since', 'anyone', 'once', 'can', 'bill', 'hereafter', 'be', 'seems', 'their', 'myself', 'nine', 'also', 'system', 'at', 'more', 'out', 'twelve', 'therein', 'almost', 'except', 'last', 'did', 'something', 'besides', 'via', 'whenever', 'formerly', 'cry', 'one', 'hundred', 'sixty', 'after', 'well', 'them', 'namely', 'empty', 'three', 'even', 'along', 'because', 'ourselves', 'such', 'top', 'due', 'inc', 'themselves'})
You can see that Gensim’s default collection of stop words is much more detailed, when compared to NLTK. Also, Gensim stores default stop words in a frozen set object.
To access the list of Gensim stop words, you need to import the frozen set STOPWORDS
from the gensim.parsing.preprocessong
package. A frozen set in Python is a type of set which is immutable. You cannot add or remove elements in a frozen set. Hence, to add an element, you have to apply the union
function on the frozen set and pass it the set of new stop words. The union
method will return a new set which contains your newly added stop words, as shown below.
The following script adds likes
and play
to the list of stop words in Gensim:
from gensim.parsing.preprocessing import STOPWORDS
all_stopwords_gensim = STOPWORDS.union(set(['likes', 'play']))
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]
print(tokens_without_sw)
Output:
['Nick', 'football', ',', 'fond', 'tennis', '.']
From the output above, you can see that the words like
and play
have been treated as stop words and consequently have been removed from the input sentence.
To remove stop words from Gensim’s list of stop words, you have to call the difference()
method on the frozen set object, which contains the list of stop words. You need to pass a set of stop words that you want to remove from the frozen set to the difference()
method. The difference()
method returns a set which contains all the stop words except those passed to the difference()
method.
The following script removes the word not
from the set of stop words in Gensim:
from gensim.parsing.preprocessing import STOPWORDS
all_stopwords_gensim = STOPWORDS
sw_list = {"not"}
all_stopwords_gensim = STOPWORDS.difference(sw_list)
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]
print(tokens_without_sw)
Output:
['Nick', 'likes', 'play', 'football', ',', 'not', 'fond', 'tennis', '.']
Since the word not
has now been removed from the stop word set, you can see that it has not been removed from the input sentence after stop word removal.
The SpaCy library in Python is yet another extremely useful language for natural language processing in Python.
To install SpaCy, you have to execute the following script on your command terminal:
$ pip install -U spacy
Once the library is downloaded, you also need to download the language model. Several models exist in SpaCy for different languages. We will be installing the English language model. Execute the following command in your terminal:
$ python -m spacy download en
Once the language model is downloaded, you can remove stop words from text using SpaCy. Look at the following script:
import spacy
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw= [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
In the script above we first load the language model and store it in the sp
variable. The sp.Default.stop_words
is a set of default stop words for English language model in SpaCy. Next, we simply iterate through each word in the input text and if the word exists in the stop word set of the SpaCy language model, the word is removed.
Here is the output:
Output:
['Nick', 'likes', 'play', 'football', ',', 'fond', 'tennis', '.']
Like the other NLP libraries, you can also add or remove stop words from the default stop word list in Spacy. But before that, we will see a list of all the existing stop words in SpaCy.
print(len(all_stopwords))
print(all_stopwords)
Output:
326
{'whence', 'here', 'show', 'were', 'why', 'n’t', 'the', 'whereupon', 'not', 'more', 'how', 'eight', 'indeed', 'i', 'only', 'via', 'nine', 're', 'themselves', 'almost', 'to', 'already', 'front', 'least', 'becomes', 'thereby', 'doing', 'her', 'together', 'be', 'often', 'then', 'quite', 'less', 'many', 'they', 'ourselves', 'take', 'its', 'yours', 'each', 'would', 'may', 'namely', 'do', 'whose', 'whether', 'side', 'both', 'what', 'between', 'toward', 'our', 'whereby', "'m", 'formerly', 'myself', 'had', 'really', 'call', 'keep', "'re", 'hereupon', 'can', 'their', 'eleven', '’m', 'even', 'around', 'twenty', 'mostly', 'did', 'at', 'an', 'seems', 'serious', 'against', "n't", 'except', 'has', 'five', 'he', 'last', '‘ve', 'because', 'we', 'himself', 'yet', 'something', 'somehow', '‘m', 'towards', 'his', 'six', 'anywhere', 'us', '‘d', 'thru', 'thus', 'which', 'everything', 'become', 'herein', 'one', 'in', 'although', 'sometime', 'give', 'cannot', 'besides', 'across', 'noone', 'ever', 'that', 'over', 'among', 'during', 'however', 'when', 'sometimes', 'still', 'seemed', 'get', "'ve", 'him', 'with', 'part', 'beyond', 'everyone', 'same', 'this', 'latterly', 'no', 'regarding', 'elsewhere', 'others', 'moreover', 'else', 'back', 'alone', 'somewhere', 'are', 'will', 'beforehand', 'ten', 'very', 'most', 'three', 'former', '’re', 'otherwise', 'several', 'also', 'whatever', 'am', 'becoming', 'beside', '’s', 'nothing', 'some', 'since', 'thence', 'anyway', 'out', 'up', 'well', 'it', 'various', 'four', 'top', '‘s', 'than', 'under', 'might', 'could', 'by', 'too', 'and', 'whom', '‘ll', 'say', 'therefore', "'s", 'other', 'throughout', 'became', 'your', 'put', 'per', "'ll", 'fifteen', 'must', 'before', 'whenever', 'anyone', 'without', 'does', 'was', 'where', 'thereafter', "'d", 'another', 'yourselves', 'n‘t', 'see', 'go', 'wherever', 'just', 'seeming', 'hence', 'full', 'whereafter', 'bottom', 'whole', 'own', 'empty', 'due', 'behind', 'while', 'onto', 'wherein', 'off', 'again', 'a', 'two', 'above', 'therein', 'sixty', 'those', 'whereas', 'using', 'latter', 'used', 'my', 'herself', 'hers', 'or', 'neither', 'forty', 'thereupon', 'now', 'after', 'yourself', 'whither', 'rather', 'once', 'from', 'until', 'anything', 'few', 'into', 'such', 'being', 'make', 'mine', 'please', 'along', 'hundred', 'should', 'below', 'third', 'unless', 'upon', 'perhaps', 'ours', 'but', 'never', 'whoever', 'fifty', 'any', 'all', 'nobody', 'there', 'have', 'anyhow', 'of', 'seem', 'down', 'is', 'every', '’ll', 'much', 'none', 'further', 'me', 'who', 'nevertheless', 'about', 'everywhere', 'name', 'enough', '’d', 'next', 'meanwhile', 'though', 'through', 'on', 'first', 'been', 'hereby', 'if', 'move', 'so', 'either', 'amongst', 'for', 'twelve', 'nor', 'she', 'always', 'these', 'as', '’ve', 'amount', '‘re', 'someone', 'afterwards', 'you', 'nowhere', 'itself', 'done', 'hereafter', 'within', 'made', 'ca', 'them'}
The output shows that there 326 stop words in the default list of stop words in the SpaCy library.
The SpaCy stop word list is basically a set of strings. You can add a new word to the set like you would add any new item to a set.
Look at the following script in which we add the word tennis
to existing list of stop words in Spacy:
import spacy
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
all_stopwords.add("tennis")
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
Output:
['Nick', 'likes', 'play', 'football', ',', 'fond', '.']
The output shows that the word tennis
has been removed from the input sentence.
You can also add multiple words to the list of stop words in SpaCy as shown below. The following script adds likes
and tennis
to the list of stop words in SpaCy:
import spacy
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
all_stopwords |= {"likes","tennis",}
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
Output:
['Nick', 'play', 'football', ',', 'fond', '.']
The ouput shows tha the words likes
and tennis
both have been removed from the input sentence.
To remove a word from the set of stop words in SpaCy, you can pass the word to remove to the remove
method of the set.
The following script removes the word not
from the set of stop words in SpaCy:
import spacy
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
all_stopwords.remove('not')
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
Output:
['Nick', 'play', 'football', ',', 'not', 'fond', '.']
In the output, you can see that the word not
has not been removed from the input sentence.
In the previous section, you saw different how we can use various libraries to remove stop words from a string in Python. If you want full control over stop word removal, you can write your own script to remove stop words from your string.
The first step in this regard is to define a list of words that you want treated as stop words. Let’s create a list of some of the most commonly used stop words:
my_stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Next, we will define a function that will accept a string as a parameter and will return the sentence without the stop words:
def remove_mystopwords(sentence):
tokens = sentence.split(" ")
tokens_filtered= [word for word in text_tokens if not word in my_stopwords]
return (" ").join(tokens_filtered)
Let’s now try to remove stop words from a sample sentence:
text = "Nick likes to play football, however he is not too fond of tennis."
filtered_text = remove_mystopwords(text)
print(filtered_text)
Output:
Nick likes play , however fond tennis .
You can see that stop words that exist in the my_stopwords
list has been removed from the input sentence.
Since my_stopwords
list is a simple list of strings, you can add or remove words into it. For example, let’s add a word football
in the list of my_stopwords
and again remove stop words from the input sentence:
text = "Nick likes to play football, however he is not too fond of tennis."
filtered_text = remove_mystopwords(text)
print(filtered_text)
Output:
Nick likes play , however fond tennis .
The output now shows that the word football
is also removed from the input sentence as we added the word in the list of our custom stop words.
Let’s now remove the word football
from the list of stop word and again apply stop word removal to our input sentence:
my_stopwords.remove("football")
text = "Nick likes to play football, however he is not too fond of tennis."
filtered_text = remove_mystopwords(text)
print(filtered_text)
Output:
Nick likes play football , however fond tennis .
The word football
has not been removed now since we removed it from the list of our stop words list.
In this article, you saw different libraries that can be used to remove stop words from a string in Python. You also saw how to add or remove stop words from lists of the default stop words provided by various libraries. At the end, we showed how this can be done if you have a custom script used for removing stop words.
Originally published by Usman Malik at https://stackabuse.com
#python #webdev