If we were asked to build an NLP application, think about how we would approach doing so at an organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all the forms of text processing needed at each step.
This step-by-step processing of text is known as a NLP pipeline. It is the series of steps involved in building any NLP model.
The main components of a generic pipeline for modern-day, data-driven NLP system development are Data acquisition, Text cleaning, Pre-processing, Feature engineering, Modeling, Evaluation, Deployment, Monitoring, and model updating.
NLP pipeline
The first step in the process of developing any NLP system is to collect data relevant to the given task. Even if we’re building a rule-based system, we still need some data to design and test our rules. The data we get is seldom(rarely) clean, and this is where text cleaning comes into play. After cleaning, text data often has a lot of variations and needs to be converted into a canonical (principle or a pre-defined way) form. This is done in the pre-processing step. This is followed by feature engineering, where we carve out indicators that are most suitable for the task at hand.
These indicators/features are converted into a format that is understandable by modeling algorithms. Then comes the modeling and evaluation phase, where we build one or more models and compare and contrast them using a relevant evaluation metric(s). Once the best model among the ones evaluated is chosen, we move towards deploying this model in production. Finally, we regularly monitor the performance of the model and, if need be, update it to keep up its performance.
In this article, we will discuss the first two steps of the NLP pipeline in detail with some code examples. We will consider our application as a model which classifies support tickets to be either sales or support query.
#naturallanguageprocessing #python #data-science #machine-learning #nlp #programming