Self-Supervised Learning Methods for Computer Vision

Self-supervised Learning is an unsupervised learning method where the supervised learning task is created out of the unlabelled input data.
This task could be as simple as given the upper-half of the image, predict the lower-half of the same image, or given the grayscale version of the colored image, predict the RGB channels of the same image, etc.

#self-supervised-learning #representation-learning #deep-learning #computer-vision #unsupervised-learning

What is GEEK

Buddha Community

Self-Supervised Learning Methods for Computer Vision

Self-Supervised Learning Methods for Computer Vision

Self-supervised Learning is an unsupervised learning method where the supervised learning task is created out of the unlabelled input data.
This task could be as simple as given the upper-half of the image, predict the lower-half of the same image, or given the grayscale version of the colored image, predict the RGB channels of the same image, etc.

#self-supervised-learning #representation-learning #deep-learning #computer-vision #unsupervised-learning

Self-supervised Representation Learning in Computer Vision — Part 2

Part 1of the series looked at representation learning and how self-supervised learning can alleviate the problem of data inefficiency in learning representations of images.

This is achieved through “contrastive learning”, a design paradigm used to learn similarities and distinctions. This paradigm boils down to making a model understand that, similar things should be closer together in terms of their representations, and dissimilar things should be far away from each other.

In this article, I will review the following two architectures:

  1. Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)
  2. A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)

I plan to cover the core concepts and some of the observations that I found interesting. For more details, please refer the original papers.


Contrastive Loss (InfoNCE)

We can theorize that contrastive learning requires a loss that maximizes similarities and minimizes dissimilarities. Let us assume the following:

  1. Query (q)— the augmented image of interest
  2. Positive sample (k₊) — all samples similar to the query
  3. **Negative sample (k₋) **— samples dissimilar to the query

Van den Oord et al. proposed a loss called Noise Contrastive Estimation (InfoNCE), which, in the context of paradigm, is:

Image for post

InfoNCE Loss

Here, we consider the encoded query vector as **q**and the dictionary containing encoded **keys **as {k₀, k₁, …}, having one positive and K negative samples, with respect to the query.

We are looking at the softmax loss of a (K+1)-way classifier, where our objective is to classify q as k₊.

The value of the loss is minimized if q is as close to k₊ as possible, while simultaneously being far away from the other K negative vectors in the dictionary.


Earlier Contrastive Learning Architectures

Image for post

Contrastive Learning using end-to-end update

One school of thought is to use an **end-to-end backpropogation based **approach, having two encoders, one generating the _query _of interest, q, and another computing dictionary keys, **k **(that are taken from the current mini-batch).

Although this would work in practice, the algorithm would be gated by the GPU memory, as the dictionary would basically be the mini-batch.

#self-supervised-learning #deep-learning #computer-vision #deep learning

Train without labeling data using Self-Supervised Learning by Relational Reasoning

Background and challenges 📋

In a modern deep learning algorithm, the dependence on manual annotation of unlabeled data is one of the major limitations. To train a good model, usually, we have to prepare a vast amount of labeled data. In the case of a small number of classes and data, we can use the pre-trained model from the labeled public dataset and fine-tune a few last layers with your data. However, in real life, it’s easily faced with the problem when your data is considerably large (the products in the store or the face of a human,…) and it will be difficult for the model to learn with just a few trainable layers. Furthermore, the amount of unlabeled data (e.g. document text, images on the Internet) is uncountable. Labeling all of them for the task is almost impossible but not utilizing them is definitely a waste.

In this case, training a deep model again from scratch with a new dataset will be an option but it takes a lot of time and effort for labeling data while using a pre-trained deep model seems no longer helpful. That is the reason why Self-supervised learning was born. The idea behind this is simple, which serves two main tasks:

  • **Surrogate task: **the deep model will learn generalizable representations from unlabeled data without annotation, and then will be able to self-generate a supervisory signal exploiting implicit information.
  • **Downstream task: **representations will be fine-tuned for supervised-learning taskse.g. classification and image retrieval with less number of labeled data (the number of labeled data depending on the performance of model based on your requirement)

There are much different training approaches proposed to learn such representations: **Relative position [1]: **themodel needs to understand the spatial context of objects to tell the relative position between parts; **Jigsaw puzzle [2]: **the model needs to place 9 shuffled patches back to the original locations; Colorization [3]: the model has trained to color a grayscale input image; precisely the task is to map this image to a distribution over quantized color value outputs; **Counting features [4]: **The model learns a feature encoder using feature counting relationship of input images transforming by _Scaling_and_Tiling; _**SimCLR [5]: **The model learns representations for visual inputs by maximizing agreement between differently augmented views of the same sample via a contrastive loss in the latent space.

However, I would like to introduce one interesting approach that is able to recognize things like a human. The key factor in human learning is the acquisition of new knowledge by comparing relating and different entities. So, it is a nontrivial solution if we can apply a similar mechanism in self-supervised machine learning via the Relational reasoning approach [6].

The relational reasoning paradigm is based on a key design principle: the use of a relation network as a learnable function on the unlabeled dataset to quantify the relationships between views of the same object (intra-reasoning) and relationships between different objects in different scenes (inter-reasoning). The possibility to exploit a similar mechanism in self-supervised machine learning via relational reasoning was evaluated by the performance on standard datasets (CIFAR-10, CIFAR-100, CIFAR-100–20, STL-10, tiny-ImageNet, SlimageNet), learning schedule, and backbones (both shallow and deep). The results show that the Relational reasoning approach largely outperforms the best competitor in all conditions by an average 14% accuracy and the most recent state-of-the-art method by 3% indicating in this paper [6].

#deep-learning #computer-vision #machine-learning #self-supervised-learning #data-science #machine learning

Why you should learn Computer Vision and how you can get started

I. Motivation

In today’s world, Computer Vision technologies are everywhere. They are embedded within many of the tools and applications that we use on a daily basis. However, we often pay little attention to those underlaying Computer Vision technologies because they tend to run in the background. As a result, only a small fraction of those outside the tech industries know about the importance of those technologies. Therefore, the goal of this article is to provide an overview of Computer Vision to those with little to no knowledge about the field. I attempt to achieve this goal by answering three questions: What is Computer Vision?, Why should you learn Computer Vision? and How you can get started?

II. What is Computer Vision?

Image for post

Figure 1: Portrait of Larry Roberts.
The field of Computer Vision dates back to the 1960s when Larry Roberts, who is now widely considered as the “Father of Computer Vision”, published his paper _Machine Perception of Three-Dimensional Solids _detailing how a computer can infer 3D shapes from a 2D image (Roberts, 1995). Since then, other researchers have made amazing contributions to the field. These advances, however, have not changed the underlaying goal of Computer Vision which is to mimic the human visual system. From an engineering point of view, this means being able to build autonomous systems that can do things a human visual system can do such as detecting and recognizing objects, recognizing faces and facial expressions, etc. (Huang, 1996). Traditionally, many approaches in Computer Vision involves manual feature extraction. This means manually finding some unique features/characteristics (edges, shapes, etc) that are only present in an object to be able to detect and recognize what that object is. Unfortunately, one major issue arises when trying to detect and recognize variations (sizes, lightning conditions, etc) of that same object. It is difficult to find features that can uniquely identify an object across all variations. Fortunately, this problem is now solved with the introduction of Machine Learning, particularly a sub-field of Machine Learning called Deep Learning. Deep Learning utilizes a form of Neural Networks called Convolutional Neural Networks (CNNs). Unlike the traditional methods, methods that utilize CNNs are able to extract features automatically. Instead of trying to figure out which features can represent an object manually, a CNN can learn those features automatically by looking at many variations of that same object. As result, many recent advancements in the field of Computer Vision involves the use of CNNs.

#computer-science #machine-learning #deep-learning #computer-vision #learning #deep learning

Alfredo  Sipes

Alfredo Sipes

1617715380

Why You Should Learn Computer Vision and How You Can Get Started

In today’s world, Computer Vision technologies are everywhere. They are embedded within many of the tools and applications that we use on a daily basis. However, we often pay little attention to those underlaying Computer Vision technologies because they tend to run in the background. As a result, only a small fraction of those outside the tech industries know about the importance of those technologies. Therefore, the goal of this article is to provide an overview of Computer Vision to those with little to no knowledge about the field. I attempt to achieve this goal by answering three questions: What is Computer Vision?, Why should you learn Computer Vision? and How you can get started?

#computer-science #machine-learning #deep-learning #computer-vision #learning