A Multigrid Method for Efficiently Training Video Models

A Multigrid Method for Efficiently Training Video Models

3D convolutional neural networks (CNNs) are state-of-the-art amongst deep learning models for videos. However, these models for videos are extremely slow.

The Problem

3D convolutional neural networks (CNNs) are state-of-the-art amongst deep learning models for videos. However, these models for videos are extremely slow.

The mini-batch shape for training a deep learning model for videos is defined by:

  • the number of clips
  • the number of frames per clip
  • the spatial size per frame.

The state-of-the-art video models generally use a large mini-batch shape for the sake of accuracy. But this is precisely what makes these models so slow.

Drawing on research from a fancy-ass field called numerical optimization, Wu, Girshick, He, Feichtenhofer & Krahenbuhl (2020) introduced a “multigrid method for efficiently training video models.” In essence, this method is a way to retain the accuracy of these video models, whilst saving time.

The Solution

The multigrid method makes use of the need to balance between large spatial and time dimensions (so the number of frames & the size per frame) versus the number of clips per mini-batch.

Before we get into the details, it intuitively makes sense that we can start with a large number of mini-batches, and smaller-dimension time and space data (coarse learning), and then switch to smaller mini-batches and more granular time and space data (fine learning).

Drawing on this intuition, Wu et al (2020) tried to answer two questions:

(i) is there a set of grids [spatial and temporal] with a grid schedule that can lead to faster training without a loss in accuracy?

(ii) if so, does it robustly generalize to new models and datasets without modification?

Formalizing the Intended Solution

First, let’s declare the problem that we are trying to solve in an equation:

b x t x h x w = B x T x H x W

So on the left, we are taking our newly scaled batch size, b, time, t, height, h, and width, w, and on the right, we have the original batch size, B, time, T, height, H, and width, W. Note that the scaled values on the left will be determined according to a schedule as training progresses, rather than being fixed throughout the training process like the values on the right would be — that is what makes this approach special.

You’ll realize that if we are trying to speed up the training process, then our scaled b should, on average, be larger than B. This means that we can cover a larger number of batches by using the scaled/scheduling approach, but we also want to make sure we don’t diminish accuracy (which is improved by a larger th, and w).

Although controlling b, that is the number of clips per mini-batch might seem fairly easy, you might be wondering how we can control the size of t, h, _and _w.

We control t, h, and w by using the concept of a sampling grid. A sampling grid consists of two values: a span and a stride. According to Wu et al (2020),

“[t]he span is the support size of the grid and defines the duration or area that the grid covers. The stride is the spacing between sampling points.”

Both space (defined by h and w) and time (t) dimensions can be resampled to smaller sizes using the sampling grid.

The resampling process requires an operatorWu et al (2020) describe an example operator: “a reconstruction filter applied to the source discrete signal followed by computing the values at the points specified by the grid (e.g., bilinear interpolation).”

In addition, for a multigrid method to work, the baseline model that we use must be “compatible with inputs that are resampled on different grids, and therefore might have different shapes during training.” In other words, we can’t use the multigrid method with models that require a fixed shape during training. As noted by Wu et al (2020), models composed of convolutions, recurrence, and self-attention are supported by the multigrid method, but not fully-connected layers (unless we pool them to a fixed size). This is not a very restrictive rule, which is good. :)

computer-vision deep-learning technology machine-learning artificial-intelligence

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Artificial Intelligence, Machine Learning, Deep Learning 

Artificial Intelligence (AI) will and is currently taking over an important role in our lives — not necessarily through intelligent robots.

AI(Artificial Intelligence): The Business Benefits of Machine Learning

Enroll now at CETPA, the best Institute in India for Artificial Intelligence Online Training Course and Certification for students & working professionals & avail 50% instant discount.

Artificial Intelligence vs. Machine Learning vs. Deep Learning

Artificial Intelligence vs. Machine Learning vs. Deep Learning. We are going to discuss we difference between Artificial Intelligence, Machine Learning, and Deep Learning

Why you should learn Computer Vision and how you can get started

A few compelling reasons for you to starting learning Computer. In today’s world, Computer Vision technologies are everywhere.

Artificial Intelligence vs. Machine Learning vs. Deep Learning

Simple explanations of Artificial Intelligence, Machine Learning, and Deep Learning and how they’re all different