Implementing Content-Based Image Retrieval with Siamese Networks in PyTorch

Image retrieval is the task of finding images related to a given query. With content-based image retrieval, we refer to the task of finding images containing some attributes which are not in the image metadata, but present in its visual content.

In this post we:

explain the theoretical concepts behind content-based image retrieval,
show step by step how to build a content-based image retrieval system with PyTorch, addressing a specific application: Finding face images with a set of given face attributes (i.e. male, blond, smiling).

Concepts explained that might be of interest:

Ranking Loss, Contrastive Loss, Siamese Nets, Triplet Nets, Triplet Loss, Image Retrieval

1. Content-based image retrieval: how to build it in high-level

In order to find the images closest to a given query, an image retrieval system needs to:

compute a similarity score between all the images in the test set (often called retrieval set) and the query.
rank all those images by the similarity with the query,
return the top ones.

A common strategy to learn those similarities is to learn representations (often called embeddings) of images and queries in the same vectorial space (often called embedding space).

In our example, that would be learning embeddings of face images and vectors encoding face attributes in the same space.

CelebA highlevel retreival

Neural networks are used to learn the aforementioned embeddings. In our case, a Convolutional Neural Network (CNN) is used to learn the image embeddings, and a Multilayer Perceptron (MLP), which is a set of fully connected layers, is used to learn the attribute vectors embeddings.

Those networks are set up in a siamese fashion and trained with a ranking loss (triplet loss in our case). We explain those concepts deeply next.

2. Architectures and losses

Ranking losses: triplet loss

Ranking losses aim to learnrelative** distances between samples**, a task which is often called metric learning.

To do so, they compute a distance (i.e. Euclidean distance) between sample representations and optimize the model to minimize it for similar samples and maximize it for dissimilar samples. Therefore, the model ends up learning similar representations for the samples you have defined as similar, and distant representations for samples you have defined as dissimilar.

#computer vision #machine learning models #machine learning

Concepts explained that might be of interest:

1. Content-based image retrieval: how to build it in high-level

2. Architectures and losses

Ranking losses: triplet loss

neptune.ai

Implementing Content-Based Image Retrieval with Siamese Networks in PyTorch