Word embeddings in 2020. Review with code examples

In this article we will study word embeddings — digital representation of words suitable for processing by machine learning algorithms.

Originally I created this article as a general overview and compilation of current approaches to word embedding in 2020, which our AI Labs team could use from time to time as a quick refresher. I hope that my article will be useful to a wider circle of data scientists and developers. Each word embedding method in the article has a (very) short description, links for further study, and code examples in Python. All code is packed as Google Colab Notebook. So let’s begin.

According to Wikipedia, Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

One-hot or CountVectorizing

The most basic method for transforming words into vectors is to count occurrence of each word in each document. Such approach is called countvectorizing or one-hot encoding.

The main principle of this method is to collect a set of documents (they can be words, sentences, paragraphs or even articles) and count the occurrence of every word in each document. Strictly speaking, the columns of the resulting matrix are words and the rows are documents.

from sklearn.feature_extraction.text import CountVectorizer
## create CountVectorizer object
vectorizer = CountVectorizer()
corpus = [
          'Text of the very first new sentence with the first words in sentence.',
          'Text of the second sentence.',
          'Number three with lot of words words words.',
          'Short text, less words.',
]
## learn the vocabulary and store CountVectorizer sparse matrix in term_frequencies
term_frequencies = vectorizer.fit_transform(corpus) 
vocab = vectorizer.get_feature_names()
## convert sparse matrix to numpy array
term_frequencies = term_frequencies.toarray()
## visualize term frequencies 
import seaborn as sns
sns.heatmap(term_frequencies, annot=True, cbar = False, xticklabels = vocab);

#nlp #data-science #machine-learning #word-embeddings #transformers #deep learning

One-hot or CountVectorizing

towardsdatascience.com

Word embeddings in 2020. Review with code examples