This article is based on week 2 of course Sequence Models on Coursera. In this article, I try to summarise and explain the concept of word representation and word embedding.

Word Representation :

Generally, we represent a word in natural language processing through a vocabulary where every word is represented by a one-hot encoded vector. Suppose we have a vocabulary(V) of 10,000 words.

V = [a, aaron, …, zulu, <UNK>]

Let’s take the word ‘ Man’ is at position 5391 in the vocabulary, then it can be represented by a one-hot encoded vector (O5391 ). The position of 1 in the sparse vector O₅₃₉₁ is the index of word Man in the vocabulary.

O5391 = [0,0,0,0,...,1,...,0,0,0]

Like that, we can other words in the vocabulary which can are represented by one-hot encoded vectors Women (O9853), King(O4914), Queen(O7157), Apple(O 456), Orange(O6257).

But this method is not an effective method to feed our algorithms to learn sequence models because the algorithm is not able to capture the relationship between different examples.

Suppose we train our model for the sentence:

I want a glass of orange juice.

And want to predict the next word for the sentence:

I want a glass of apple _____.

Even if both examples are almost the same and our algorithm well trained, but it failed to predict the next word in test example. The reason behind this is that in the case of one-hot encoded vector representation the inner product between two one-hot encoded vectors is 0. Even if take the Euclidean distance between for any two vectors it is also 0.

We know that the next word would juice in the example we take, but the algorithm is not able to find any relationship between the words of the above two sentences, and it fails to predict the word in the sentence.

To solve this problem we take the help of word embeddings, it is the featurized representation of words. For each word in the vocabulary, we can learn a set of features and values.

#sequence-model #coursera #nlp #deep-learning #machine-learning

Natural Language Processing and Word Embeddings
1.30 GEEK