Exploratory text analysis in Python

Why do we do exploratory data analysis before we build a model? I would say ‘to understand the data better so that we preprocess the data in a suitable way and choose an appropriate modelling technique’. This necessity to understand data is still relevant when working with text data. This post is the first of the three sequential articles on steps to build a sentiment classifier. In this post, we will look at one way to conduct exploratory data analysis on text, or exploratory text analysis for brevity.

Image for post

Photo by Andrew Neel on Unsplash

Before we dive in, let’s take a step back and look at the bigger picture first. CRISP-DM methodology outlines the process flow for a successful data science project. In the diagram below, 2–4th stages of data science project are shown. In data understanding stage, exploratory data analysis is one of the key tasks.

Image for post

Extract from CRISP-DM process flow

When working on a data science project, it is not unusual to be going back and forth between stages rather than linearly progressing. This is because ideas and questions come up in subsequent stages and you want to go back a stage or two to try out the idea or find the answer to the question. The pink arrows are not in the official CRISP-DM, however, I think these are often necessary. In fact, we will be doing a bit of data preparation in this post for the purpose of exploratory text analysis. For those who are interested to learn more about CRISP-DM, this is a nice short introduction and this resource provides a more detailed explanation.

#data-science #text-preprocessing #nlp #python #data analytic

towardsdatascience.com

Exploratory text analysis in Python