Stacked Capsule Autoencoders

Introduction

During the last few years, Geoffrey Hinton and a team of researchers started working on a revolutionary new type of neural network based on Capsules.

Some of the main motivations behind this study are that current neural networks like Convolutional Neural Networks (CNNs) are able to achieve the state of the art accuracy in computer vision tasks such as object detection only if provided a large amount of data.

One of the main reason why models like CNNs require such a large amount of data is their inability to capture orientational and spatial relationships between the different elements which compose an image. In fact, one of the main techniques used in order to improve CNNs performances is Data Augmentation. When applying Data Augmentation, we help our model learn more in-depth and in a more general way what characterises different objects by creating additional data from the original one by for example rotating, cropping, flipping, etc… the original images. In this way, our model will more likely be able to recognise the same object even if seen from a different perspective (Figure 1).

Image for post

Figure 1: Viewing the same object from a different perspective can cause misconception [1].

CNNs are able to detect objects by first identifying edges and shapes in an image and then combining them together. This approach although does not take into account spacial hierarchies which construct the overall image, and therefore leading to the need of creating large datasets in order to perform well (increasing therefore also the computational cost necessary in order to train the model).

Capsules

Geoffrey Hinton approach of using Capsules closely follow instead the principle of Inverse Graphic. In fact, according to Hinton, every time our brain processes a new object, its representation does not depend on the viewing angle. Therefore, in order to create models able to perform object recognition as good as our brain can do, we need to be able to capture the hierarchical relationship of the different parts which compose an object and relate them with respect to a frame of coordinates.

This can be achieved by basing our network on a structure called Capsule. A capsule is a data structure incorporating in a vector form all the main information of the feature we are detecting. Its main constituents are:

A logistic unit which represents if a shape exists in an image.
A matrix representing the pose of the shape.
A vector embedding other information such as colour, deformations, etc…

Different approaches have been proposed by Hinton research team during the last few years in order to create Capsule Network such as:

2017: Dynamic Routing (discriminative learning and part-whole relationships).
2018: Expectation–maximization Algorithm (discriminative learning and part-whole relationships).
2019: Stacked Capsule Networks (unsupervised Learning, whole parts relationships).

In Capsule Networks, the different neurons compete with each other in order to find agreeing parts which compose objects in images. Three different approaches can be used in order to measure agreements between different capsules:

Using cosine distance as a measure of agreement.
Expectation-Maximization.
Mixture Models.

#data-science #machine-learning #artificial-intelligence #research #towards-data-science

Introduction

Capsules

towardsdatascience.com

Stacked Capsule Autoencoders