Want to do NLP? Learn how to work with Text Data.

Image for post

Photo by Romain Vignes on Unsplash

Companies all over the world in the field of Artificial Intelligence and machine learning are preparing their mind in the field of text or language processing models. AI is improving the processes and making machines smart enough to have a conversation with them. Example: a chatbot customer person on the websites. We all heard about AI in science fiction movies like Matrix, Star Trek or Jarvis in the Iron man series. Now with the growth of computational power in the technology, we can see AI products in our everyday lives.

Natural Language Processing is an amazing field in data science and artificial intelligence that deals with how to extract meaning from the text and teach machines to behave accordingly. In this article, we will examine the various methods to do feature extractions in text analysis.

How a simple code can make a text better for reading and understandable as shown below:

#function to remove the dots in the text
def data(msg):
    return msg.strip(".")

msg = "....Hello World...."
data(msg)
#output:
Hello World

Ways to store the data in the pandas:

  1. **Object **— dtype numpy array
  2. String — dtype extension type

Pandas 1.0, the object dtype was the only option. This was unfortunate for many reasons:

  • You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.
  • object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.
  • When reading code, the contents of an object dtype array is less clear than ‘string’.

Object dtype remains the default type we infer a list of strings.

#the series is object type
import pandas as pd 
pd.Series(['a', 'b', 'c'])

#output:
0    a
1    b
2    c
dtype: object

#python #programming #nlp #data-science #analytics #panches

Text Data Analysis and  Manipulation with Pandas
2.85 GEEK