Distributed training of Deep Learning models with PyTorch - The motive of this article is to demonstrate the idea of Distributed Computing in the context of training large scale Deep Learning (DL) models.

*Originally published by **Ayan Das **at** medium.com*

In particular, the article first presents the basic concepts of distributed computing and how it fits into the idea of Deep learning. Then it moves on to listing the standard requirements (hardware and software) for setting up an environment capable of handling distributed applications. Finally, to provide a hands-on experience, it demonstrates a specific distributed algorithm (namely **Synchronous SGD**) for training DL models from a theoretical as well as implementation perspective.

**Distributed computing** refers to the way of writing a program that makes use several distinct components connected over network. Typically, large scale computation is achieved by such an arrangement of computers capable of handling high density numeric computations in parallel. In distributed computing terminology, these computers are often referred to as **nodes** and a collection of such nodes form a **cluster** over the network. These nodes are usually connected via **Ethernet****,** but other **high-bandwidth networks** are also used to take full advantage of the distributed architecture.

Although Neural Networks, the main workhorse of DL, has been in the literature from quite a while, nobody could utilise its full potential until recently. One of the primary reasons for the sudden boost in its popularity has something to with massive computational power, the very idea we are trying to address in this article. Deep learning requires training **Deep neural networks (DNN)** with massive number of parameters on a huge amount of data. Distributed computing is a perfect tool to take advantage of the modern hardware to its fullest. Here is the core idea:

A properly crafted distributed algorithm can:

- “
**Distribute**” computation (forward and backward pass of a DL model) along with data across multiple nodes for coherent processing. - It can then establish an effective “
**Synchronization”**among the nodes to achieve consistency.

One more terminology you have to get used to — **Message Passing Interface (MPI)**. MPI is the workhorse of almost all of distributed computing. MPI is an open standard that defines a set of rules on how the nodes will talk to each other over network and also a programming model/API. MPI is not a software or tool, it’s a *specification*. A group of individuals, organizations from academia and industry came forward in the summer of 1991 which eventually led to the creation of **MPI Forum**. The forum, with a consensus, crafted a *syntactic and semantic specification* of a library that is to be served as a guideline for different hardware vendors to come up with portable/ flexible/optimized implementations. Several hardware vendors have their own implementation of MPI — “**OpenMPI**”, “**MPICH**”, “**MVAPICH**”, “**Intel MPI**” and lot more.

In this tutorial, we are going to use **Intel MPI** as it is very performant and also optimized for Intel platforms. Original Intel MPI is a *C library* and very low level in nature.

Proper setup of a distributed system is very important. Without proper hardware and network arrangements, it’s pretty much useless even if one has conceptual understanding of it’s programming model. Below are the key arrangements need to be made:

- A set of nodes connected in a common network forming a cluster is typically required. It is recommended to have high-end servers as nodes and high-bandwidth network like InfiniBand.
- Linux systems with user accounts of
are required on all the nodes in the cluster.*exact same name* - Nodes must have
connectivity among them. This is very crucial for seamless connectivity.*password-less SSH* - An MPI implementation must be installed. This tutorial focuses on Intel MPI only.
- A
*common filesystem*is required which is visible from all the nodes and the distributed applications must reside on it.**Network Filesystem (NFS)**is one way to achieve this.

There are two popular ways of parallelizing Deep learning models:

- Model parallelism
- Data parallelism

Model parallelism refers to a model being logically split into several parts (i.e., some layers in one part and some in other), then placing them on different hardware/devices. Although placing the parts on different devices does have benefits in terms of execution time (asynchronous processing of data), it is usually employed to avoid memory constraints. Models with very large number of parameters, which are difficult fit into a single system due to high memory footprint, benefits from this type of strategy.

Data parallelism, on the other hand, refers to processing multiple pieces (technically batches) of data through multiple replicas of the same network located on different hardware/devices. Unlike model parallelism, each replica may be an entire network and not just a part of it. This strategy, as you might

have guessed, can scale up well with increasing amount of data. But, as the entire network has to reside on a single device, it cannot help models with high memory footprints. The illustration below should make it clear.

Practically, Data parallelism is more popular and frequently employed in large organizations for executing production quality DL training algorithms. So, in this tutorial, we will fix our focus on data parallelism.

`torch.distributed"`

API**PyTorch** offers a very elegant and easy-to-use API as an interface to the underlying MPI library written in C. PyTorch needs to be compiled from source and must be linked against the Intel MPI installed on the system. We will now see the basic usage of `torch.distributed`

and how to execute it.

# filename 'ptdist.py' import torch import torch.distributed as dist def main(rank, world): if rank == 0: x = torch.tensor([1., -1.]) # Tensor of interest dist.send(x, dst=1) print('Rank-0 has sent the following tensor to Rank-1') print(x) else: z = torch.tensor([0., 0.]) # A holder for recieving the tensor dist.recv(z, src=0) print('Rank-1 has recieved the following tensor from Rank-0') print(z)if

name== 'main':

dist.init_process_group(backend='mpi')

main(dist.get_rank(), dist.get_world_size())

Executing the above code using `mpiexec`

, a distributed process scheduler comes with any standard MPI implementation, results in:

[email protected]:~/nfs$mpiexec -n 2 -ppn 1 -hosts miriad2a,miriad2b python ptdist.py

Rank-0 has sent the following tensor to Rank-1

tensor([ 1., -1.])

Rank-1 has recieved the following tensor from Rank-0

tensor([ 1., -1.])

- The first line to be executed is
`dist.init_process_group(backend)`

which basically sets up the internal communication channel among the participating nodes. It takes an argument to specify which*backend*to use. As we are using MPI throughout, its`backend=’mpi’`

in our case. There are other backends as well (like “**TCP**”, “**Gloo**” and “**NCCL**”). - Two parameters need to be retrieved — the
**world size**and**rank**. - “
**World**” refers to the collection of all nodes that have been specified in a particular context of`mpiexec`

invocation (see the`-hosts`

flag in`mpiexec`

). - “
**Rank**” is a unique integer assigned by the MPI runtime to each of the processes. It starts from 0. The order in which they are specified in the argument of`-hosts`

is used to assign the numbers. So, in this case, the process on node “*miriad2a*” will be assigned Rank 0 and “*miriad2b*” will be Rank 1. `x`

is a tensor that Rank 0 intends to send to Rank 1. It does so by`dist.send(x, dst=1)`

.`z`

is something that Rank 1 created before receiving the tensor. We need an already created tensor of same shape as a holder for receiving the incoming tensor. The values of`z`

will eventually be replaced by the value of`x`

.- Just like
`dist.send(..)`

, the receiving counterpart is`dist.recv(z, src=0)`

which receives the tensor into`z`

.

What we saw in the last section is an example of “**peer-to-peer**” communication where rank(s) send data to specific rank(s) in a given context. Although this is useful as it provides user with granular control over the communication, there exist other standard and frequently used *patterns of communication* called **collectives**. Below is the description of one particular collective (known as **all-reduce**) which is of interest to us in the context of **Synchronous SGD**algorithm.

**All-reduce** is a way of synchronized communication where *a given reduction operation is operated on all the ranks and the reduced result is made available to all of them*. The below figure illustrates the idea (uses summation as the reduction operation).

def main(rank, world):

if rank == 0:

x = torch.tensor([1.])

elif rank == 1:

x = torch.tensor([2.])

elif rank == 2:

x = torch.tensor([-3.])`dist.all_reduce(x, op=dist.reduce_op.SUM) print('Rank {} has {}'.format(rank, x))`

if

name== 'main':

dist.init_process_group(backend='mpi')

main(dist.get_rank(), dist.get_world_size())

When launched in a *world of 3*, results in

[email protected]:~/nfs$mpiexec -n 3 -ppn 1 -hosts miriad2a,miriad2b,miriad2c python ptdist.py

Rank 1 has tensor([0.])

Rank 0 has tensor([0.])

Rank 2 has tensor([0.])

- The
`if rank == <some rank> … elif`

is a pattern we encounter again and again in distributed computing. In this case, it is used to create different tensors on different ranks. - They all execute an
**all-reduce**together (see that`dist.all_reduce(..)`

is outside`if … elif`

block) with summation (`dist.reduce_op.SUM`

) as reduction operation. `x`

from every rank is summed up and the summation is placed inside the same`x`

of every rank.

It is assumed that the reader is familiar with the standard **Stochastic Gradient Descent (SGD)** algorithm which is often used to train deep learning models. We will now see a variant of SGD (called **Synchronous SGD**) that makes use of the All-reduce collective to scale up. To lay the foundation, let’s start with the mathematical formulation of standard SGD.

where **D** is a set (mini-batch) of samples, θ is the set of all parameters, λ is the learning rate and `Loss(X, y)`

is some loss function averaged over all samples in **D**.

The core trick that **Synchronous SGD** relies on is splitting the summation in the update rule over smaller subsets of (mini)batches. **D** is split into **R** number of subsets D₁, D₂, . . (preferably with same number of samples in each) such that

Splitting the summation of standard SGD update formula leads to

Now, as the gradient operator is distributive over summation operator, we get

Have a look at those individual gradient terms (inside square brackets) in the above equation. They can now be computed independently and summed up to get the original gradient without any loss/approximation. This is where the data parallelism comes into picture. Here is the whole story:

- Split the entire dataset into
**R**equal chunks. The letter**R**is used to refer to*Replica*. - Launch
**R**processes/ranks using MPI and bind each process to one chunk of the dataset. - Let each rank compute the gradient using a mini-batch (dᵣ) of size
**B**from its own portion of data, i.e., rank**r**computes

4. Sum up all the gradients of all the ranks and make the resulting gradient available to all of them to proceed further.

The last point is exactly the **all-reduce** algorithm. So, all-reduce must be executed every time all ranks have computed one gradient (on a mini-batch of size **B**) on their own portion of the dataset. A subtle point to note here is that summing up the gradients (on mini-batches of size **B**) from all **R** ranks leads

to an effective batch size of

The following are the crucial parts of the implementation (the boilerplate codes are not shown)

model = LeNet() first synchronization of initial weightssync_initial_weights(model, rank, world_size)

optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.85)

model.train()

for epoch in range(1, epochs + 1):

for data, target in train_loader:

optimizer.zero_grad()

output = model(data)

loss = F.nll_loss(output, target)

loss.backward()`# The all-reduce on gradients sync_gradients(model, rank, world_size) optimizer.step()`

def sync_initial_weights(model, rank, world_size):

for param in model.parameters():

if rank == 0:

# Rank 0 is sending it's own weight

# to all it's siblings (1 to world_size)

for sibling in range(1, world_size):

dist.send(param.data, dst=sibling)

else:

# Siblings must recieve the parameters

dist.recv(param.data, src=0)def sync_gradients(model, rank, world_size):

for param in model.parameters():

dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)

- All
**R**ranks create their own copy/replica of the model with random weights. - Individual replicas with random weights may lead to initial de-synchronization. It is preferable to synchronize the initial weights among all the replicas. The
`sync_initial_weights(..)`

routine does exactly that. Let any one of the ranks send its weights to its siblings and the siblings must receive them to initialize themselves with it. - Fetch a mini-batch (of size
**B**) from the respective portion of a rank and compute forward and backward pass (gradient). Important point to note here as a part of the setup, is all processes/ranks should have its own portion of data visible (usually on its own*hard-disk*OR on a*shared Filesystem*). - Execute all-reduce collective on the gradients of each replica with summation as the reduction operation. The
`sync_gradients(..)`

routine does the gradient synchronization. - After gradients have been synchronized, every replica can execute a standard SGD update on its own weights independently. The
`optimizer.step()`

does the job as usual.

Now a question might arise, “*How do we ensure that independent updates will remain in sync?*”.

If we take a look at the update equation for the first update

Point 2 & 4 above ensure that the *initial weights* and the *gradients* are synchronized individually. For obvious reason, a linear combination of them will also be in sync (λ is a constant). A similar logic holds for all consecutive updates.

The biggest bottleneck for any distributed algorithm is the synchronization. Distributed algorithms are beneficial only if the synchronization time is significantly less than computation time. Let’s have a simple comparison between the standard and synchronous SGD to see when is the later one beneficial.

**Definitions.** Let’s assume the size of the entire dataset is **N**. Mini-batches of size **B** are processed by the network which takes time **T**comp. In the distributed case, time taken for all-reduce synchronization is **T**sync. If there are **R**replicas, time taken for one epoch

So, for the distributed setting to be significantly beneficial over non-distributed one, we need to have

OR, equivalently

The three factors contributing to the above inequality can be tweaked to extract more and more benefit out of the distributed algorithm.

**T**sync can be reduced by connecting the nodes over a high bandwidth (fast) network.**T**comp can be increased by increasing batch size**B**.**R**can be increased by connecting more nodes over the network and having more replicas.

Hopefully, the article was clear enough to convey the central idea of **Distributed Computing** in the context of Deep Learning. Although, **Synchronous SGD** is quite popular, there are other distributed algorithms which are also used quite frequently (like **Asynchronous SGD and its variants**). But, what is more important is to be able to think about deep learning methods in a *parallel manner*. Please realize that not all algorithms can be parallelized out-of-the-box; some require approximations to be made which break theoretical guarantees given by the original algorithms. It is up to the algorithm designer/implementer to tackle these approximations in an efficient way.

*Originally published by **Ayan Das **at** medium.com*

=========================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on **Facebook** | **Twitter**

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Complete Guide to TensorFlow for Deep Learning with Python

☞ Data Science, Deep Learning, & Machine Learning with Python

☞ Deep Learning: Recurrent Neural Networks in Python

☞ Machine Learning & Tensorflow - Google Cloud Approach

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

Implementing Deep Learning Papers - Deep Deterministic Policy Gradients (using Python)

In this intermediate deep learning tutorial, you will learn how to go from reading a paper on deep deterministic policy gradients to implementing the concepts in Tensorflow. This process can be applied to any deep learning paper, not just deep reinforcement learning.

In the second part, you will learn how to code a deep deterministic policy gradient (DDPG) agent using Python and PyTorch, to beat the continuous lunar lander environment (a classic machine learning problem).

DDPG combines the best of Deep Q Learning and Actor Critic Methods into an algorithm that can solve environments with continuous action spaces. We will have an actor network that learns the (deterministic) policy, coupled with a critic network to learn the action-value functions. We will make use of a replay buffer to maximize sample efficiency, as well as target networks to assist in algorithm convergence and stability.

**Thanks for watching** ❤

If you liked this post, share it with all of your programming buddies!

Follow us on **Facebook** | **Twitter**

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks

☞ Artificial Intelligence A-Z™: Learn How To Build An AI

☞ A Complete Machine Learning Project Walk-Through in Python

☞ Machine Learning: how to go from Zero to Hero

☞ Top 18 Machine Learning Platforms For Developers

☞ 10 Amazing Articles On Python Programming And Machine Learning

☞ 100+ Basic Machine Learning Interview Questions and Answers

Basic programming concept in any language will help but not require to attend this tutorial

Description

Become a Python Programmer and learn one of employer's most requested skills of 21st century!

This is the most comprehensive, yet straight-forward, course for the Python programming language on Simpliv! Whether you have never programmed before, already know basic syntax, or want to learn about the advanced features of Python, this course is for you! In this course we will teach you Python 3. (Note, we also provide older Python 2 notes in case you need them)

With over 40 lectures and more than 3 hours of video this comprehensive course leaves no stone unturned! This course includes tests, and homework assignments as well as 3 major projects to create a Python project portfolio!

This course will teach you Python in a practical manner, with every lecture comes a full coding screencast and a corresponding code notebook! Learn in whatever manner is best for you!

We will start by helping you get Python installed on your computer, regardless of your operating system, whether its Linux, MacOS, or Windows, we've got you covered!

We cover a wide variety of topics, including:

Command Line Basics

Installing Python

Running Python Code

Strings

Lists

Dictionaries

Tuples

Sets

Number Data Types

Print Formatting

Functions

Scope

Built-in Functions

Debugging and Error Handling

Modules

External Modules

Object Oriented Programming

Inheritance

Polymorphism

File I/O

Web scrapping

Database Connection

Email sending

and much more!

Project that we will complete:

Guess the number

Guess the word using speech recognition

Love Calculator

google search in python

Image download from a link

Click and save image using openCV

Ludo game dice simulator

open wikipedia on command prompt

Password generator

QR code reader and generator

You will get lifetime access to over 40 lectures.

So what are you waiting for? Learn Python in a way that will advance your career and increase your knowledge, all in a fun and practical way!

Basic knowledge

Basic programming concept in any language will help but not require to attend this tutorial

What will you learn

Learn to use Python professionally, learning both Python 2 and Python 3!

Create games with Python, like Tic Tac Toe and Blackjack!

Learn advanced Python features, like the collections module and how to work with timestamps!

Learn to use Object Oriented Programming with classes!

Understand complex topics, like decorators.

Understand how to use both the pycharm and create .py files

Get an understanding of how to create GUIs in the pycharm!

Build a complete understanding of Python from the ground up!

Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python

Machine Learning, Data Science and Deep Learning with PythonExplore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg

In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.

In this course, we will cover:

• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)

• The History of Artificial Neural Networks

• Deep Learning in the Tensorflow Playground

• Deep Learning Details

• Introducing Tensorflow

• Using Tensorflow

• Introducing Keras

• Using Keras to Predict Political Parties

• Convolutional Neural Networks (CNNs)

• Using CNNs for Handwriting Recognition

• Recurrent Neural Networks (RNNs)

• Using a RNN for Sentiment Analysis

• The Ethics of Deep Learning

• Learning More about Deep Learning

At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.

Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!

This is hands-on tutorial with real code you can download, study, and run yourself.