Natural Language Processing (NLP) is commonly used in text classification tasks such as spam detection and sentiment analysis, text generation, language translations and document classification. Text data can be considered either in sequence of character, sequence of words or sequence of sentences. Most commonly, text data are considered as sequence of words for most problems. In this article we will delve into, pre-processing using simple example text data. However, the steps discussed here are applicable to any NLP tasks. Particularly, we’ll use TensorFlow2 Keras for text pre-processing which include:
The figure below depicts the process of text pre-processing along with example outputs.
Step by step text pre-processing example starting from raw sentence to padded sequence
First, let’s import the required libraries. (A complete Jupyter notebook is available in my GitHub page).
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
Tokenizer is an API available in TensorFlow Keras which is used to tokenize sentences. We have defined our text data as sentences (each separated by a comma) and with an array of strings. There are 4 sentences including 1 with a maximum length of 5. Our text data also includes punctuations as shown below.
sentences = ["I want to go out.",
" I like to play.",
" No eating - ",
"No play!",
]
sentences
['I want to go out.', ' I like to play.', ' No eating - ', 'No play!']
As deep learning models do not understand text, we need to convert text into numerical representation. For this purpose, a first step is Tokenization. The Tokenizer API from TensorFlow Keras splits sentences into words and encodes these into integers. Below are hyperparameters used within Tokenizer API:
#nlp #data-science #tensorflow #deep learning