Distributed training of Deep Learning models with PyTorch

Distributed training of Deep Learning models with PyTorch

Distributed training of Deep Learning models with PyTorch - The motive of this article is to demonstrate the idea of Distributed Computing in the context of training large scale Deep Learning (DL) models.

Originally published by Ayan Das at  medium.com

In particular, the article first presents the basic concepts of distributed computing and how it fits into the idea of Deep learning. Then it moves on to listing the standard requirements (hardware and software) for setting up an environment capable of handling distributed applications. Finally, to provide a hands-on experience, it demonstrates a specific distributed algorithm (namely Synchronous SGD) for training DL models from a theoretical as well as implementation perspective.

What is Distributed computing?

Distributed computing refers to the way of writing a program that makes use several distinct components connected over network. Typically, large scale computation is achieved by such an arrangement of computers capable of handling high density numeric computations in parallel. In distributed computing terminology, these computers are often referred to as nodes and a collection of such nodes form a cluster over the network. These nodes are usually connected via Ethernet, but other high-bandwidth networks are also used to take full advantage of the distributed architecture.

How does Deep learning benefit from distributed computing?

Although Neural Networks, the main workhorse of DL, has been in the literature from quite a while, nobody could utilise its full potential until recently. One of the primary reasons for the sudden boost in its popularity has something to with massive computational power, the very idea we are trying to address in this article. Deep learning requires training Deep neural networks (DNN) with massive number of parameters on a huge amount of data. Distributed computing is a perfect tool to take advantage of the modern hardware to its fullest. Here is the core idea:

A properly crafted distributed algorithm can:

  1. Distribute” computation (forward and backward pass of a DL model) along with data across multiple nodes for coherent processing.
  2. It can then establish an effective “Synchronization” among the nodes to achieve consistency.

MPI: A distributed computing standard

One more terminology you have to get used to — Message Passing Interface (MPI). MPI is the workhorse of almost all of distributed computing. MPI is an open standard that defines a set of rules on how the nodes will talk to each other over network and also a programming model/API. MPI is not a software or tool, it’s a specification. A group of individuals, organizations from academia and industry came forward in the summer of 1991 which eventually led to the creation of MPI Forum. The forum, with a consensus, crafted a syntactic and semantic specification of a library that is to be served as a guideline for different hardware vendors to come up with portable/ flexible/optimized implementations. Several hardware vendors have their own implementation of MPI — “OpenMPI”, “MPICH”, “MVAPICH”, “Intel MPI” and lot more.

In this tutorial, we are going to use Intel MPI as it is very performant and also optimized for Intel platforms. Original Intel MPI is a C library and very low level in nature.

The Setup

Proper setup of a distributed system is very important. Without proper hardware and network arrangements, it’s pretty much useless even if one has conceptual understanding of it’s programming model. Below are the key arrangements need to be made:

  1. A set of nodes connected in a common network forming a cluster is typically required. It is recommended to have high-end servers as nodes and high-bandwidth network like InfiniBand.
  2. Linux systems with user accounts of exact same name are required on all the nodes in the cluster.
  3. Nodes must have password-less SSH connectivity among them. This is very crucial for seamless connectivity.
  4. An MPI implementation must be installed. This tutorial focuses on Intel MPI only.
  5. common filesystem is required which is visible from all the nodes and the distributed applications must reside on it. Network Filesystem (NFS) is one way to achieve this.

Types of parallelization strategies

There are two popular ways of parallelizing Deep learning models:

  1. Model parallelism
  2. Data parallelism

Model parallelism

Model parallelism refers to a model being logically split into several parts (i.e., some layers in one part and some in other), then placing them on different hardware/devices. Although placing the parts on different devices does have benefits in terms of execution time (asynchronous processing of data), it is usually employed to avoid memory constraints. Models with very large number of parameters, which are difficult fit into a single system due to high memory footprint, benefits from this type of strategy.

Data parallelism

Data parallelism, on the other hand, refers to processing multiple pieces (technically batches) of data through multiple replicas of the same network located on different hardware/devices. Unlike model parallelism, each replica may be an entire network and not just a part of it. This strategy, as you might

have guessed, can scale up well with increasing amount of data. But, as the entire network has to reside on a single device, it cannot help models with high memory footprints. The illustration below should make it clear.

Practically, Data parallelism is more popular and frequently employed in large organizations for executing production quality DL training algorithms. So, in this tutorial, we will fix our focus on data parallelism.

The "torch.distributed" API

PyTorch offers a very elegant and easy-to-use API as an interface to the underlying MPI library written in C. PyTorch needs to be compiled from source and must be linked against the Intel MPI installed on the system. We will now see the basic usage of torch.distributed and how to execute it.

# filename 'ptdist.py'
import torch
import torch.distributed as dist
def main(rank, world):
    if rank == 0:
        x = torch.tensor([1., -1.]) # Tensor of interest
        dist.send(x, dst=1)
        print('Rank-0 has sent the following tensor to Rank-1')
        z = torch.tensor([0., 0.]) # A holder for recieving the tensor
        dist.recv(z, src=0)
        print('Rank-1 has recieved the following tensor from Rank-0')

if name == 'main':
main(dist.get_rank(), dist.get_world_size())

Executing the above code using mpiexec, a distributed process scheduler comes with any standard MPI implementation, results in:

[email protected]:~/nfs$ mpiexec -n 2 -ppn 1 -hosts miriad2a,miriad2b python ptdist.py
Rank-0 has sent the following tensor to Rank-1
tensor([ 1., -1.])
Rank-1 has recieved the following tensor from Rank-0
tensor([ 1., -1.])
  1. The first line to be executed is dist.init_process_group(backend) which basically sets up the internal communication channel among the participating nodes. It takes an argument to specify which backend to use. As we are using MPI throughout, its backend=’mpi’ in our case. There are other backends as well (like “TCP”, “Gloo” and “NCCL”).
  2. Two parameters need to be retrieved — the world size and rank.
  3. World” refers to the collection of all nodes that have been specified in a particular context of mpiexecinvocation (see the -hosts flag in mpiexec).
  4. Rank” is a unique integer assigned by the MPI runtime to each of the processes. It starts from 0. The order in which they are specified in the argument of -hosts is used to assign the numbers. So, in this case, the process on node “miriad2a” will be assigned Rank 0 and “miriad2b” will be Rank 1.
  5. x is a tensor that Rank 0 intends to send to Rank 1. It does so by dist.send(x, dst=1).
  6. z is something that Rank 1 created before receiving the tensor. We need an already created tensor of same shape as a holder for receiving the incoming tensor. The values of z will eventually be replaced by the value of x.
  7. Just like dist.send(..), the receiving counterpart is dist.recv(z, src=0) which receives the tensor into z.

Communication collectives

What we saw in the last section is an example of “peer-to-peer” communication where rank(s) send data to specific rank(s) in a given context. Although this is useful as it provides user with granular control over the communication, there exist other standard and frequently used patterns of communication called collectives. Below is the description of one particular collective (known as all-reduce) which is of interest to us in the context of Synchronous SGDalgorithm.

The “All-reduce” collective

All-reduce is a way of synchronized communication where a given reduction operation is operated on all the ranks and the reduced result is made available to all of them. The below figure illustrates the idea (uses summation as the reduction operation).

def main(rank, world):
if rank == 0:
x = torch.tensor([1.])
elif rank == 1:
x = torch.tensor([2.])
elif rank == 2:
x = torch.tensor([-3.])

dist.all_reduce(x, op=dist.reduce_op.SUM)
print('Rank {} has {}'.format(rank, x))

if name == 'main':
main(dist.get_rank(), dist.get_world_size())

When launched in a world of 3, results in

[email protected]:~/nfs$ mpiexec -n 3 -ppn 1 -hosts miriad2a,miriad2b,miriad2c python ptdist.py
Rank 1 has tensor([0.])
Rank 0 has tensor([0.])
Rank 2 has tensor([0.])
  1. The if rank == <some rank> … elif is a pattern we encounter again and again in distributed computing. In this case, it is used to create different tensors on different ranks.
  2. They all execute an all-reduce together (see that dist.all_reduce(..) is outside if … elif block) with summation (dist.reduce_op.SUM) as reduction operation.
  3. x from every rank is summed up and the summation is placed inside the same x of every rank.

Moving on to Deep learning:

It is assumed that the reader is familiar with the standard Stochastic Gradient Descent (SGD) algorithm which is often used to train deep learning models. We will now see a variant of SGD (called Synchronous SGD) that makes use of the All-reduce collective to scale up. To lay the foundation, let’s start with the mathematical formulation of standard SGD.

where D is a set (mini-batch) of samples, θ is the set of all parameters, λ is the learning rate and Loss(X, y) is some loss function averaged over all samples in D.

The core trick that Synchronous SGD relies on is splitting the summation in the update rule over smaller subsets of (mini)batches. D is split into R number of subsets D₁, D₂, . . (preferably with same number of samples in each) such that

Splitting the summation of standard SGD update formula leads to

Now, as the gradient operator is distributive over summation operator, we get

What do we get out of this?

Have a look at those individual gradient terms (inside square brackets) in the above equation. They can now be computed independently and summed up to get the original gradient without any loss/approximation. This is where the data parallelism comes into picture. Here is the whole story:

  1. Split the entire dataset into R equal chunks. The letter R is used to refer to Replica.
  2. Launch R processes/ranks using MPI and bind each process to one chunk of the dataset.
  3. Let each rank compute the gradient using a mini-batch (dᵣ) of size B from its own portion of data, i.e., rank rcomputes

4. Sum up all the gradients of all the ranks and make the resulting gradient available to all of them to proceed further.

The last point is exactly the all-reduce algorithm. So, all-reduce must be executed every time all ranks have computed one gradient (on a mini-batch of size B) on their own portion of the dataset. A subtle point to note here is that summing up the gradients (on mini-batches of size B) from all R ranks leads

to an effective batch size of

The following are the crucial parts of the implementation (the boilerplate codes are not shown)

model = LeNet()

first synchronization of initial weights

sync_initial_weights(model, rank, world_size)

optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.85)

for epoch in range(1, epochs + 1):
for data, target in train_loader:
output = model(data)
loss = F.nll_loss(output, target)

    # The all-reduce on gradients
    sync_gradients(model, rank, world_size)

def sync_initial_weights(model, rank, world_size):
for param in model.parameters():
if rank == 0:
# Rank 0 is sending it's own weight
# to all it's siblings (1 to world_size)
for sibling in range(1, world_size):
dist.send(param.data, dst=sibling)
# Siblings must recieve the parameters
dist.recv(param.data, src=0)

def sync_gradients(model, rank, world_size):
for param in model.parameters():
dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)

  1. All R ranks create their own copy/replica of the model with random weights.
  2. Individual replicas with random weights may lead to initial de-synchronization. It is preferable to synchronize the initial weights among all the replicas. The sync_initial_weights(..) routine does exactly that. Let any one of the ranks send its weights to its siblings and the siblings must receive them to initialize themselves with it.
  3. Fetch a mini-batch (of size B) from the respective portion of a rank and compute forward and backward pass (gradient). Important point to note here as a part of the setup, is all processes/ranks should have its own portion of data visible (usually on its own hard-disk OR on a shared Filesystem).
  4. Execute all-reduce collective on the gradients of each replica with summation as the reduction operation. The sync_gradients(..) routine does the gradient synchronization.
  5. After gradients have been synchronized, every replica can execute a standard SGD update on its own weights independently. The optimizer.step() does the job as usual.

Now a question might arise, “How do we ensure that independent updates will remain in sync?”.

If we take a look at the update equation for the first update

Point 2 & 4 above ensure that the initial weights and the gradients are synchronized individually. For obvious reason, a linear combination of them will also be in sync (λ is a constant). A similar logic holds for all consecutive updates.

Performance comparison:

The biggest bottleneck for any distributed algorithm is the synchronization. Distributed algorithms are beneficial only if the synchronization time is significantly less than computation time. Let’s have a simple comparison between the standard and synchronous SGD to see when is the later one beneficial.

Definitions. Let’s assume the size of the entire dataset is N. Mini-batches of size B are processed by the network which takes time Tcomp. In the distributed case, time taken for all-reduce synchronization is Tsync. If there are Rreplicas, time taken for one epoch

For non-distributed (standard) SGD:

For Synchronous SGD:

So, for the distributed setting to be significantly beneficial over non-distributed one, we need to have

OR, equivalently

The three factors contributing to the above inequality can be tweaked to extract more and more benefit out of the distributed algorithm.

  1. Tsync can be reduced by connecting the nodes over a high bandwidth (fast) network.
  2. Tcomp can be increased by increasing batch size B.
  3. R can be increased by connecting more nodes over the network and having more replicas.

Hopefully, the article was clear enough to convey the central idea of Distributed Computing in the context of Deep Learning. Although, Synchronous SGD is quite popular, there are other distributed algorithms which are also used quite frequently (like Asynchronous SGD and its variants). But, what is more important is to be able to think about deep learning methods in a parallel manner. Please realize that not all algorithms can be parallelized out-of-the-box; some require approximations to be made which break theoretical guarantees given by the original algorithms. It is up to the algorithm designer/implementer to tackle these approximations in an efficient way.

Originally published by Ayan Das at  medium.com


Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Complete Guide to TensorFlow for Deep Learning with Python

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning: Recurrent Neural Networks in Python

☞ Machine Learning & Tensorflow - Google Cloud Approach

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

Implementing Deep Learning Papers - Deep Deterministic Policy Gradients (using Python)

Implementing Deep Learning Papers - Deep Deterministic Policy Gradients (using Python)

Implementing Deep Learning Papers - Deep Deterministic Policy Gradients (using Python)

In this intermediate deep learning tutorial, you will learn how to go from reading a paper on deep deterministic policy gradients to implementing the concepts in Tensorflow. This process can be applied to any deep learning paper, not just deep reinforcement learning.

In the second part, you will learn how to code a deep deterministic policy gradient (DDPG) agent using Python and PyTorch, to beat the continuous lunar lander environment (a classic machine learning problem).

DDPG combines the best of Deep Q Learning and Actor Critic Methods into an algorithm that can solve environments with continuous action spaces. We will have an actor network that learns the (deterministic) policy, coupled with a critic network to learn the action-value functions. We will make use of a replay buffer to maximize sample efficiency, as well as target networks to assist in algorithm convergence and stability.

Thanks for watching

If you liked this post, share it with all of your programming buddies!

Follow us on Facebook | Twitter

Further reading about Deep Learning and Python

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python for Data Science and Machine Learning Bootcamp

Machine Learning, Data Science and Deep Learning with Python

Deep Learning A-Z™: Hands-On Artificial Neural Networks

Artificial Intelligence A-Z™: Learn How To Build An AI

A Complete Machine Learning Project Walk-Through in Python

Machine Learning: how to go from Zero to Hero

Top 18 Machine Learning Platforms For Developers

10 Amazing Articles On Python Programming And Machine Learning

100+ Basic Machine Learning Interview Questions and Answers

Machine Learning for Front-End Developers

Top 30 Python Libraries for Machine Learning

Learn Python Tutorial from Basic to Advance

Learn Python Tutorial from Basic to Advance

Basic programming concept in any language will help but not require to attend this tutorial

Become a Python Programmer and learn one of employer's most requested skills of 21st century!

This is the most comprehensive, yet straight-forward, course for the Python programming language on Simpliv! Whether you have never programmed before, already know basic syntax, or want to learn about the advanced features of Python, this course is for you! In this course we will teach you Python 3. (Note, we also provide older Python 2 notes in case you need them)

With over 40 lectures and more than 3 hours of video this comprehensive course leaves no stone unturned! This course includes tests, and homework assignments as well as 3 major projects to create a Python project portfolio!

This course will teach you Python in a practical manner, with every lecture comes a full coding screencast and a corresponding code notebook! Learn in whatever manner is best for you!

We will start by helping you get Python installed on your computer, regardless of your operating system, whether its Linux, MacOS, or Windows, we've got you covered!

We cover a wide variety of topics, including:

Command Line Basics
Installing Python
Running Python Code
Number Data Types
Print Formatting
Built-in Functions
Debugging and Error Handling
External Modules
Object Oriented Programming
File I/O
Web scrapping
Database Connection
Email sending
and much more!
Project that we will complete:

Guess the number
Guess the word using speech recognition
Love Calculator
google search in python
Image download from a link
Click and save image using openCV
Ludo game dice simulator
open wikipedia on command prompt
Password generator
QR code reader and generator
You will get lifetime access to over 40 lectures.

So what are you waiting for? Learn Python in a way that will advance your career and increase your knowledge, all in a fun and practical way!

Basic knowledge
Basic programming concept in any language will help but not require to attend this tutorial
What will you learn
Learn to use Python professionally, learning both Python 2 and Python 3!
Create games with Python, like Tic Tac Toe and Blackjack!
Learn advanced Python features, like the collections module and how to work with timestamps!
Learn to use Object Oriented Programming with classes!
Understand complex topics, like decorators.
Understand how to use both the pycharm and create .py files
Get an understanding of how to create GUIs in the pycharm!
Build a complete understanding of Python from the ground up!

Machine Learning, Data Science and Deep Learning with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Machine Learning, Data Science and Deep Learning with Python

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks

Explore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:
• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)
• The History of Artificial Neural Networks
• Deep Learning in the Tensorflow Playground
• Deep Learning Details
• Introducing Tensorflow
• Using Tensorflow
• Introducing Keras
• Using Keras to Predict Political Parties
• Convolutional Neural Networks (CNNs)
• Using CNNs for Handwriting Recognition
• Recurrent Neural Networks (RNNs)
• Using a RNN for Sentiment Analysis
• The Ethics of Deep Learning
• Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.