Is MLP Better Than CNN & Transformers For Computer Vision?

Is MLP Better Than CNN & Transformers For Computer Vision?

Earlier this month, Google researchers released a new algorithm called MLP-Mixer, an architecture based exclusively on multi-layered perceptrons (MLPs) for computer vision.

Earlier this month, Google researchers released a new algorithm called MLP-Mixer, an architecture based exclusively on multi-layered perceptrons (MLPs) for computer vision. The MLP-Mixer code is now available on GitHub.

Register for Free Hands-on Workshop: oneAPI AI Analytics Toolkit

MLP is used to solve machine learning problems like tabular datasets, classification prediction problems and regression prediction problems. Apart from convolutional neural networks (CNN) and attention-based networks (transformers), researchers & developers use MLPs extensively in image processing.

In a recent paper, Google has introduced MLP-Mixer.

*“While convolutions and attention are both sufficient for good performance, neither of them are necessary,” reads the paper, *MLP-Mixer: An all-MLP Architecture for Vision, co-authored by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Thomas Unterthiner, Lucas Beyer, Xiaohua Zhai, Jessica Yung, Daniel Keysers, Mario Lucic, Jakob Uszkoreit and Alexey Dosovitskiy. 

Interestingly, the new model achieves similar results compared to the state-of-the-art models trained on large datasets with almost 3x speed. “When trained on large datasets, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models,” claimed Google AI. 

Architecture for computer vision 

MLP-Mixer constraints two types of layers — one with MLPs applied independently to image patches (‘mixing’ the per-location features), and one with MLPs used across patches (‘mixing’ spatial information).

The image below depicts the macro-structure of Mixer with Mixer layers, per-patch linear embeddings and a classifier head. Mixer layers contain one channel-mixing MLP and one token-mixing MLP, each consisting of two fully connected layers and a GELU nonlinearity. Other components include skip-connections, layer norm on the channels, dropout, and linear classifier head. 

The model accepts a sequence of linearly projected image patches as input and maintains the dimensionality. On the other hand, Mixer uses two layers of MLP: channel mixing MLPs and token-mixing MLPs. 

Channel-mixing MLPs allow communication between different channels, and they operate independently on each token and rows of the table as inputs. Similarly, the token-mixing MLPs allow communication between various spatial locations (tokens); they operate on each channel independently and take an individual column of the table as inputs. The layers (channel-mixing MLPs and token-mixing MLPs) are interspersed to enable interaction of both input dimensions. 

MLP-Mixer vs CNN vs vision transformers

“In the extreme situation, our architecture can be seen as a unique CNN, which uses (1×1) convolutions for channel mixing, and single-channel depth-wise convolutions for token mixing. However, the converse is not true as CNNs are not special cases of Mixer,” explained Google AI. 

Convolution is more complex than the plain matrix multiplication in MLPs as it requires an additional cost reduction to matrix multiplication or specialised implementation. 


####### What’s Cooking Inside Google Research Labs In India For AI?

Today, image processing networks typically involve mixed features at a given location or mix the features between multiple locations. For instance, in CNNs, both mixes happen with convolutions, kernels, and pooling, while vision transformers perform them with self-attention. 

MLP-Mixer, on the other hand, attempts to do both in a more ‘separate’ fashion and only using MLPs. The advantage of only using MLPs — essentially matrix multiplication — is the simplicity of the architecture and the computational speed. 

Also, the computational complexity of the MLP-Mixer is linear in the number of input patches, unlike vision transformers whose complexity is quadratic. Also, the model uses skip connections and regularisation

The advantages of MLP-Mixer include:

  • Identical size of the layers 
  • 2 MLP blocks across each layer 
  • Takes same size inputs across each layer
  • All image patches are projected linearly with the same projection matrix

opinions all-mlp architecture cnn vs transformers google ai machine learning models

What is Geek Coin

What is GeekCash, Geek Token

Best Visual Studio Code Themes of 2021

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

AI vs Machine Learning vs Deep Learning | AI vs ML vs DL | Machine Learning Training with Python

This video is about the difference between the three terms Artificial Intelligence, Machine Learning & Deep Learning. AI vs ML vs DL. AI vs Machine Learning vs Deep Learning | AI vs ML vs DL | Machine Learning Training with Python

Hire Machine Learning Engineer | Offshore Machine Learning Experts

We are a Machine Learning Services provider offering custom AI solutions, Machine Learning as a service & deep learning solutions. Hire Machine Learning experts & build AI Chatbots, Neural networks, etc. 16+ yrs & 2500+ clients.

Inside MoveNet, Google’s Latest Pose Detection Model

Inside MoveNet, Google’s Latest Pose Detection Model. Let's explore it with us now.

Top Machine Learning Models and Algorithms in 2021

This article will highlight the different techniques used in Machine Learning development. After that, we will focus on the top Machine Learning models examples and algorithms that enable the execution of applications for deriving insights from data.

Top 10 Fun Machine Learning Experiments By Google Released in 2020

Experiments with Google is an exciting website where developers creates intuitive experiments based on machine learning