On the face of it, Natural Language Processing (NLP) and time series analysis do not necessarily appear to have that much in common.

In the context of data science, the main reasons for analysing text are typically as follows:

  1. Text Summarization (i.e. summarize a text in order to gain a better understanding of it)
  2. Text Classification (e.g. classifying text based on certain features such as detecting spam emails)
  3. Sentiment Analysis (using text classification to determine the sentiment of a particular group on a certain topic, e.g. book reviews)
  4. Text Generation (i.e. generating new text on a particular topic using machine learning techniques)

When Text Classification Models Can Fail

I particularly wish to address the domains of text classification and sentiment analysis in this regard.

Let’s consider an example. Suppose that one built a sentiment analysis model in 2019 in order to gauge sentiment on travel. Data might have been collated from a variety of social networks, e.g. Twitter, Reddit, etc.

Chances are — sentiment on travel might have still been quite positive — notwithstanding a degree of concern due to the impacts of travel on climate change.

However, 2020 is a vastly different landscape for travel (or lack thereof), with air passenger numbers having plummeted as a result of the COVID-19 pandemic.

As a result, any sentiment model that would have been trained on 2019 data would likely perform quite poorly if run today. Travel restrictions, virus fears, and economic concerns are likely to have been under-represented in any corpus that would have been used to train a text classification model to gauge travel sentiment. Moreover, the term “COVID-19” did not exist before this year, and a text classification model would not know to assign a negative sentiment to this term in the context of travel.

#data-science #timeseries #nlp #machine-learning

NLP From A Time Series Perspective
2.50 GEEK