Consider you have a large text dataset on which you want to apply some non-trivial NLP transformations, such as stopword removal followed by lemmatizing the words (i.e. reducing them to root form) in the text. spaCy is an industrial-strength NLP library designed for just such a task.
In this post, the New York Times dataset is used to showcase how to significantly speed up a spaCy NLP pipeline. The goal is to take in an article’s text, and speedily return a list of lemmas as well as unnecessary words, i.e. stopwords, removed.
Pandas DataFrames provide a convenient interface to work with tabular data of this nature — the spaCy NLP methods can be conveniently applied directly to the relevant column of the DataFrame. First, the news data is obtained by running the preprocessing notebook (./data/preprocessing.ipynb
), which processes the raw text file downloaded from Kaggle and performs some basic cleaning on it. This step generates a file that contains the tabular data (stored as nytimes.tsv
). A curated stopword file is also provided in the same directory.
Since we will not be doing any specialized tasks such as dependency parsing and named entity recognition in this exercise, these components are disabled when loading the spaCy model.
_Tip: _spaCy has a
_sentencizer_
component that can be plugged into a blank pipeline.
The sentencizer pipeline simply performs tokenization and sentence boundary detection, following which lemmas can be extracted as token properties.
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
view raw
sentencizer.py hosted with ❤ by GitHub
A method is defined to read in stopwords from a text file and convert it to a set in Python (for efficient lookup).
def get_stopwords():
"Return a set of stopwords read in from a file."
with open(stopwordfile) as f:
stopwords = []
for line in f:
stopwords.append(line.strip("\n"))
# Convert to set for performance
stopwords_set = set(stopwords)
return stopwords_set
stopwords = get_stopwords()
view raw
read_stopwords.py hosted with ❤ by GitHub
The pre-processed version of the NYT news dataset is read in as a Pandas DataFrame. The columns are named date
, headline
and content
- the text present in the content column is what will be preprocessed to remove stopwords and generate token lemmas.
def read_data(inputfile):
"Read in a tab-separated file with date, headline and news content"
df = pd.read_csv(inputfile, sep='\t', header=None,
names=['date', 'headline', 'content'])
df['date'] = pd.to_datetime(df['date'], format="%Y-%m-%d")
return df
df = read_data(inputfile)
df.head(3)
view raw
read_input_data.py hosted with ❤ by GitHub
#spacy #machine-learning #data-science #nlp #programming #data analysis