NLP 101 — Data Preprocessing & Representation Using NLTK.

NLP or** Natural Language Processing** primarily deals with how machines understand, convert and perceive textual data present in human-readable languages into formats that they can perform computations on. Contemporary corporates often work with huge amounts of data. That data can be moulded into a variety of different forms and formats including text documents, emails, tweets, blog posts, spreadsheets, audio recordings, JSONs, online activity logs and more. One of the most common ways that such data is recorded is via text. This text usually draws parallels to the natural languages that we use in our day-to-day conversations, both online and offline.

Natural Language Processing (NLP) is the discipline of programming computers to process, analyze and parse large amounts of this very natural textual data to build effective and generalized machine learning models. But, unlike its numeric counterpart, before this textual data can be used to build models, it must first be preprocessed, visualized, represented and moulded in such a way that it can be effectively handled.

This is where python’s NLTK or Natural Language Processing Toolkit module comes into the picture. NLTK is one of the leading platforms out there used to work with human language data. It provides ready to use, convenient methods of data handling and pre-processing that are most commonly deployed to mould the human-readable text into a workable format.

As this article adopts a more hands-on approach to using NLTK as a framework, I won’t be delving too deep into all the jargons and terminologies associated with it and will be addressing only the ones that are either most commonly used, or will be used in the implementation that follows. If you are still curious, do check out this blog by one of our Core Committee members, that explains all the terms one might come across whilst

#data-science #python #nltk #machine-learning #nlp

medium.com

NLP 101 — Data Preprocessing & Representation Using NLTK.