Self-supervised Representation Learning on Videos

Nowadays, transfer learning from Imagenet is the absolute standard in computer vision. Self-supervised learning dominates natural language processing, but this doesn’t mean that there are no significant use-cases for computer vision that it should be considered.
There are indeed a lot of cool self-supervised tasks that one can devise when one deals with images, such as jigsaw puzzles [6], image colorization, image inpainting, or even unsupervised image synthesis.
But what happens when the time dimension comes into play? How can you approach the video-based tasks that you would like to solve?
So, let’s start from the beginning, one concept at a time. What is self-supervised learning? And how is it different from transfer learning? What is a pretext task?

#self-supervised-learning #video-processing #machine-learning #deep-learning

What is GEEK

Buddha Community

Self-supervised Representation Learning on Videos

Self-supervised Representation Learning on Videos

Nowadays, transfer learning from Imagenet is the absolute standard in computer vision. Self-supervised learning dominates natural language processing, but this doesn’t mean that there are no significant use-cases for computer vision that it should be considered.
There are indeed a lot of cool self-supervised tasks that one can devise when one deals with images, such as jigsaw puzzles [6], image colorization, image inpainting, or even unsupervised image synthesis.
But what happens when the time dimension comes into play? How can you approach the video-based tasks that you would like to solve?
So, let’s start from the beginning, one concept at a time. What is self-supervised learning? And how is it different from transfer learning? What is a pretext task?

#self-supervised-learning #video-processing #machine-learning #deep-learning

Self-Supervised Learning Methods for Computer Vision

Self-supervised Learning is an unsupervised learning method where the supervised learning task is created out of the unlabelled input data.
This task could be as simple as given the upper-half of the image, predict the lower-half of the same image, or given the grayscale version of the colored image, predict the RGB channels of the same image, etc.

#self-supervised-learning #representation-learning #deep-learning #computer-vision #unsupervised-learning

Train without labeling data using Self-Supervised Learning by Relational Reasoning

Background and challenges 📋

In a modern deep learning algorithm, the dependence on manual annotation of unlabeled data is one of the major limitations. To train a good model, usually, we have to prepare a vast amount of labeled data. In the case of a small number of classes and data, we can use the pre-trained model from the labeled public dataset and fine-tune a few last layers with your data. However, in real life, it’s easily faced with the problem when your data is considerably large (the products in the store or the face of a human,…) and it will be difficult for the model to learn with just a few trainable layers. Furthermore, the amount of unlabeled data (e.g. document text, images on the Internet) is uncountable. Labeling all of them for the task is almost impossible but not utilizing them is definitely a waste.

In this case, training a deep model again from scratch with a new dataset will be an option but it takes a lot of time and effort for labeling data while using a pre-trained deep model seems no longer helpful. That is the reason why Self-supervised learning was born. The idea behind this is simple, which serves two main tasks:

  • **Surrogate task: **the deep model will learn generalizable representations from unlabeled data without annotation, and then will be able to self-generate a supervisory signal exploiting implicit information.
  • **Downstream task: **representations will be fine-tuned for supervised-learning taskse.g. classification and image retrieval with less number of labeled data (the number of labeled data depending on the performance of model based on your requirement)

There are much different training approaches proposed to learn such representations: **Relative position [1]: **themodel needs to understand the spatial context of objects to tell the relative position between parts; **Jigsaw puzzle [2]: **the model needs to place 9 shuffled patches back to the original locations; Colorization [3]: the model has trained to color a grayscale input image; precisely the task is to map this image to a distribution over quantized color value outputs; **Counting features [4]: **The model learns a feature encoder using feature counting relationship of input images transforming by _Scaling_and_Tiling; _**SimCLR [5]: **The model learns representations for visual inputs by maximizing agreement between differently augmented views of the same sample via a contrastive loss in the latent space.

However, I would like to introduce one interesting approach that is able to recognize things like a human. The key factor in human learning is the acquisition of new knowledge by comparing relating and different entities. So, it is a nontrivial solution if we can apply a similar mechanism in self-supervised machine learning via the Relational reasoning approach [6].

The relational reasoning paradigm is based on a key design principle: the use of a relation network as a learnable function on the unlabeled dataset to quantify the relationships between views of the same object (intra-reasoning) and relationships between different objects in different scenes (inter-reasoning). The possibility to exploit a similar mechanism in self-supervised machine learning via relational reasoning was evaluated by the performance on standard datasets (CIFAR-10, CIFAR-100, CIFAR-100–20, STL-10, tiny-ImageNet, SlimageNet), learning schedule, and backbones (both shallow and deep). The results show that the Relational reasoning approach largely outperforms the best competitor in all conditions by an average 14% accuracy and the most recent state-of-the-art method by 3% indicating in this paper [6].

#deep-learning #computer-vision #machine-learning #self-supervised-learning #data-science #machine learning

Self-supervised Representation Learning in Computer Vision — Part 2

Part 1of the series looked at representation learning and how self-supervised learning can alleviate the problem of data inefficiency in learning representations of images.

This is achieved through “contrastive learning”, a design paradigm used to learn similarities and distinctions. This paradigm boils down to making a model understand that, similar things should be closer together in terms of their representations, and dissimilar things should be far away from each other.

In this article, I will review the following two architectures:

  1. Momentum Contrast for Unsupervised Visual Representation Learning (MoCo)
  2. A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)

I plan to cover the core concepts and some of the observations that I found interesting. For more details, please refer the original papers.


Contrastive Loss (InfoNCE)

We can theorize that contrastive learning requires a loss that maximizes similarities and minimizes dissimilarities. Let us assume the following:

  1. Query (q)— the augmented image of interest
  2. Positive sample (k₊) — all samples similar to the query
  3. **Negative sample (k₋) **— samples dissimilar to the query

Van den Oord et al. proposed a loss called Noise Contrastive Estimation (InfoNCE), which, in the context of paradigm, is:

Image for post

InfoNCE Loss

Here, we consider the encoded query vector as **q**and the dictionary containing encoded **keys **as {k₀, k₁, …}, having one positive and K negative samples, with respect to the query.

We are looking at the softmax loss of a (K+1)-way classifier, where our objective is to classify q as k₊.

The value of the loss is minimized if q is as close to k₊ as possible, while simultaneously being far away from the other K negative vectors in the dictionary.


Earlier Contrastive Learning Architectures

Image for post

Contrastive Learning using end-to-end update

One school of thought is to use an **end-to-end backpropogation based **approach, having two encoders, one generating the _query _of interest, q, and another computing dictionary keys, **k **(that are taken from the current mini-batch).

Although this would work in practice, the algorithm would be gated by the GPU memory, as the dictionary would basically be the mini-batch.

#self-supervised-learning #deep-learning #computer-vision #deep learning

Larry  Kessler

Larry Kessler

1618303817

Self-Supervised Representation Learning from Wearable Data in Federated Setting

Smartphones, wearables, and Internet of Things (IoT) devices produce a wealth of data that cannot be accumulated in a centralized repository for learning supervised models due to privacy, bandwidth limitations, and the prohibitive cost of annotations. Federated learning provides a compelling framework for learning models from decentralized data, but conventionally, it assumes the availability of labeled samples, whereas on-device data are generally either unlabeled or cannot be annotated readily through user interaction.
To address these issues, a self-supervised approach termed scalogram signal correspondence learning [1] based on wavelet transform is proposed, which learns useful representations from unlabeled sensor inputs, such as electroencephalography, blood volume pulse, accelerometer, and WiFi channel state information

#wearables #deep-learning #federated-learning #self-supervised-learning