Do you realize you can google up anything today and can be sure to find something related to it on the internet? This comes from the huge amount of text data available freely for us. You must be intrigued enough to use all this data for your machine learning models. The problem is, machines don’t recognize and understand the text as we do. Need to know a way to work around this?

The magnificent trick lies in the wonderful world of Natural Language Processing. The unstructured text data needs to be cleaned first before we can proceed to the modelling stage. Tokenization is used for splitting a phrase or a paragraph into words or sentences.

In this article, we will start with the first step of data pre-processing i.e Tokenization. Further, we will implement different methods in python to perform tokenization of text data.


#developers corner #data cleaning #data-science

Hands-On Guide To Different Tokenization Methods In NLP
1.50 GEEK