A Practical Demonstration of Using Vision Transformers in PyTorch

A Practical Demonstration of Using Vision Transformers in PyTorch

In this article, I will give a hands-on example (with code) of how one can use the popular PyTorch framework to apply the Vision Transformer, which was suggested in the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (which I reviewed in another post), to a practical computer vision task.

In this article, I will give a hands-on example (with code) of how one can use the popular PyTorch framework to apply the Vision Transformer, which was suggested in the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (which I reviewed in another post), to a practical computer vision task.

To do that, we will look at the problem of handwritten digit recognition using the well-known MNIST dataset.

Image for post

Examples of MNIST handwritten digits generated using Pyplot

I would like to provide a caveat right away, just to make it clear. I chose the MNIST dataset for this demonstration because it is simple enough so that a model can be trained on it from scratch and used for predictions without any specialized hardware within minutes, not hours or days, so literally anyone with a computer can do it and see how it works. I haven’t tried much to optimize the hyperparameters of the model, and I certainly didn’t have the goal of achieving state-of-the-art accuracy (currently around 99.8% for this dataset) with this approach.

In fact, while I will show that the Vision Transformer can attain a respectable 98%+ accuracy on MNIST, it can be argued that it is not the best tool for this job. Since each image in this dataset is small (just 28x28 pixels) and consists of a single object, applying global attention can only be of limited utility. I might write another post later to examine how this model can be used on a bigger dataset with larger images and a greater variety of classes. For now, I just want to show how it works.

transformers attention-model mnist-dataset computer-vision machine-learning

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Why you should learn Computer Vision and how you can get started

A few compelling reasons for you to starting learning Computer. In today’s world, Computer Vision technologies are everywhere.

4 Pre-Trained CNN Models to Use for Computer Vision with Transfer Learning

4 Pre-Trained CNN Models to Use for Computer Vision with Transfer Learning. Using State-of-the-Art Pre-trained Neural Network Models to Tackle Computer Vision Problems with Transfer Learning

What is Supervised Machine Learning

What is neuron analysis of a machine? Learn machine learning by designing Robotics algorithm. Click here for best machine learning course models with AI

Pros and Cons of Machine Learning Language

AI, Machine learning, as its title defines, is involved as a process to make the machine operate a task automatically to know more join CETPA

How To Get Started With Machine Learning With The Right Mindset

You got intrigued by the machine learning world and wanted to get started as soon as possible, read all the articles, watched all the videos, but still isn’t sure about where to start, welcome to the club.