Gone are the days when the word “DATA” meant a structured set of information in terms of numbers, categories, names, etc represented in tabular format and monopoly of Relational Database Management Systems (RDBMS). Advent in technology has caused a torrent of unstructured data such as audio files/signals, videos, text, images, and much more and that has led to the discovery of new-age data processing techniques and algorithms.

The Plethora of data sources offers unprecedented opportunities to acquire a deeper and holistic understanding of various concepts and make informed decisions.

Our world is now “digitized” and “datafied”. Whether you like it or not, the Internet might know you better than your loved ones.

Statistics mentioned below should be able to give you a feel of the volume of data we are generating and the immense opportunities and challenges it offers.

Image for post

Analyzing text data is now the cornerstone of analytics in all domains of industry. For e.g. analyzing customer reviews/feedback on platforms such as Facebook, twitter, blogs, websites , etc offers crucial information on customer sentiments and it might even inspire initiate new service or a product.

“ My objective, through this article is to pique your interest in NLP and inspire you to explore the depth of concepts such as Vectorization, Topic modeling and feature engineering, etc. ”

Prediction using unstructured data can get pretty complex process and its hard to cover all topics in a single article, therefore I would be focusing on the per-processing phase, for now. Topics such as vectorization, Topic modeling, etc. would be covered in my upcoming articles.

Image for post

By the end of this article, you should -

A) Understand concept of Natural Language Processing.

B) Learn basics of Spacy and NLTK library in python.

C) Learn techniques of text cleaning and Exploratory Data Analysis (EDA) of Text data.

Concepts discussed in the article will largely be based on the below topics -

Image for post

What is Natural Language Processing (NLP)?

Image for post

in simple terms -

According to projections by IDC, 80 % of data generated by 2025 will be in unstructured format, which means it would be text heavy and does not have any predefined data model. That’s where NLP comes into play to give context to massive unstructured data, which helps find the needle of insight in the haystack of information.

“Natural Language Processing refers to the host of techniques adopted to ingest and transform text data to a shape and form which computers can process.”

#nlp #text-processing #textblob #spacy #naturallanguageprocessing #data science

How to process text data for modeling ?— Natural Language processing
1.65 GEEK