Embedding Of Text Documents using Tensorflow Universal Sentence Encoder and Spark EMR

How to run a massive inference of multilingual text sentences using a powerful pre-trained model from the TensorFlow Hub.

Tensorflow HUB makes available a variety of pre-trained models ready to use for inference. A very powerful model is the (Multilingual) Universal Sentence Encoder  that allows embedding bodies of text written in any language into a common numerical vector representation.

Embedding text is a very powerful natural language processing (NLP) technique for extracting features from text fields. Those features can be used for training other models or for data analysis takes such as clustering documents or search engines based on word semantics.

Unfortunately, if we have billions of text data to encode it might take several days to run on a single machine.

In this tutorial, I will show how to leverage Spark. In particular, we will use the AWS-managed Elastic MapReduce (EMR) service to apply the sentence encoder to a large dataset and complete it in a matter of a couple of hours.

