Feature Transform of Text Data

Feature Transformation is the process of converting raw data which can be of Text, Image, Graph, Time series etc… into numerical feature (Vectors). So that we can perform all algebraic operation on it.

Text data usually consists of documents which can represent words, sentences or even paragraphs of free flowing text_._ The inherent unstructured (no neatly formatted data columns!) and noisy nature of textual data makes it harder for machine learning methods to directly work on raw text data. Hence, in this article, we will explore some of the most popular and effective strategies for transforming text data into feature vectors. These features can then be used in building machine learning or deep learning models easily.

Terminology

Document- A “document” is a distinct text. This generally means that each article, book, or so on.

**Corpus- **A Corpus is collection of documents.

**Vocabulary- **The set of unique words used in the text corpus is referred to as the vocabulary.

Feature Transform Technique’s

1. Bag of Words (BOW)

2. Term Frequency and Inverse document Frequency (TFIDF)

3. Word Embedding using Embedding Layer

4. Word to Vectors

Important point to note here is before performing any of the above mentioned Feature Transformation Technique its mandatory to perform text preprocessing to standardize data, remove noise and reduce dimension. Here is my blog on text preprocessing.

1. Bag of Words (BOW)

This is one of the most simple vector space representational model for unstructured text. A vector space model is simply a mathematical model to represent unstructured text (or any other data) as numeric vectors, such that each dimension of the vector is a specific feature\attribute. The bag of words model represents each text document as a numeric vector where each dimension is a specific word from the corpus and the value could be its frequency in the document, occurrence (denoted by 1 or 0) or even weighted values. The model’s name is such because each document is represented literally as a ‘bag’ of its own words, disregarding word orders, sequences and grammar.

Lets say our corpus have 3 documents as follows

I. This movie is very scary and long

II. This movie is not scary and is slow

III. This movie is spooky and good

we need to design the Vocabulary, i.e. list of all unique the words(ignoring case and punctuation).so we end up with following words ‘this’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’. This words individually represents a dimension in vector space.

Because we know the vocabulary has 11 words, we can use a fixed-length of 11 to represent each document, with one position in the vector to score each word. The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present. For example “this movie is very scary and long” can be represented as [1 1 1 1 1 1 1 0 0 0 0].Same way we can represent all the document and arrive at document term matrix as below.

#feature-extraction #embedding #word2vec #data-science

Terminology

Feature Transform Technique’s

1. Bag of Words (BOW)

medium.com

Feature Transform of Text Data — NLP