Documents embeddings and text classification without coding

Documents embeddings and text classification without coding

In this post, we explain what document embedding is, why it is useful, and show its usage on the classification example without coding. For the analysis, we will use the Orange open-source tool.

Text is described by the sequence of character. Since every machine learning algorithm needs numbers, we need to transform the text into vectors of real numbers before we can continue with the analysis. To do this, we can use various approaches. The most known approach before the evolution of deep learning was the bag of words which is still widely used because of its advantages. The recent boom in the deep learning brought us new approaches such as word and document embeddings. In this post, we explain what document embedding is, why it is useful, and show its usage on the classification example without coding. For the analysis, we will use the Orange open-source tool.

Word embedding and document embedding

Before we can understand document embeddings, we need to understand the concept of word embeddings. Word embedding is a representation of a word in multidimensional space such that words with similar meanings have similar embedding. It means that each word is mapped to the vector of real numbers that represent the word. Embedding models are mostly based on neural networks.

Document embedding is usually computed from the word embeddings in two steps. First, each word in the document is embedded with the word embedding then word embeddings are aggregated. The most common type of aggregation is the average over each dimension.

Why and when should we use embedders?

Compared to bag-of-words, which counts the number of appearances of each token (word) in the document, embeddings have two main advantages:

  • They do not have a dimensionality problem. The result of bag-of-words is a table which has the number of features equal to the number of unique tokens in all documents in a corpus. Large corpora with long texts result in a large number of unique tokens. It results in huge tables which can exceed the computer memory. Huge tables also increase the learning and evaluation time of machine learning models. Embeddings have constant dimensionality of the vector, which is 300 for fastText embeddings that Orange uses.
  • *Most of the preprocessing is not required. *In the case of the bag-of-words approach, we solve the dimensionality problem with the text preprocessing where we remove tokens (e.g. words) that seems to be less important for the analysis. It can also cause the removal of some important tokens. When using embedders, we do not need to remove tokens, so we are not losing accuracy. Also most of the basic preprocessing can be omitted (such as normalization) in case of fastText embedding.
  • Embeddings can be pretrained on large corpora with billions of tokens. That way, they capture the significant characteristics of the language and produce the embeddings of high quality. Pretrained models are then used to obtain embeddings of smaller datasets.

The shortcoming of the embedders is that they are difficult to understand. For example, when we use a bag-of-words, we can easily observe which tokens are important for classification since tokens themselves are features. In the case of document embeddings, features are numbers which are not understandable to human by themselves.

embedding data-science text-analysis machine-learning text-embedding

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

15 Machine Learning and Data Science Project Ideas with Datasets

Learning is a new fun in the field of Machine Learning and Data Science. In this article, we’ll be discussing 15 machine learning and data science projects.

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Exploratory Data Analysis is a significant part of Data Science

You will discover Exploratory Data Analysis (EDA), the techniques and tactics that you can use, and why you should be performing EDA on your next problem.

Why You Should Learn R — Learn Data Science with Dataquest

Why should you learn R programming when you're aiming to learn data science? Here are six reasons why R is the right language for you.

Best Free Datasets for Data Science and Machine Learning Projects

This post will help you in finding different websites where you can easily get free Datasets to practice and develop projects in Data Science and Machine Learning.