Originally published by Ayan Das at medium.com
In particular, the article first presents the basic concepts of distributed computing and how it fits into the idea of Deep learning. Then it moves on to listing the standard requirements (hardware and software) for setting up an environment capable of handling distributed applications. Finally, to provide a hands-on experience, it demonstrates a specific distributed algorithm (namely Synchronous SGD) for training DL models from a theoretical as well as implementation perspective.
Distributed computing refers to the way of writing a program that makes use several distinct components connected over network. Typically, large scale computation is achieved by such an arrangement of computers capable of handling high density numeric computations in parallel. In distributed computing terminology, these computers are often referred to as nodes and a collection of such nodes form a cluster over the network. These nodes are usually connected via Ethernet, but other high-bandwidth networks are also used to take full advantage of the distributed architecture.
Although Neural Networks, the main workhorse of DL, has been in the literature from quite a while, nobody could utilise its full potential until recently. One of the primary reasons for the sudden boost in its popularity has something to with massive computational power, the very idea we are trying to address in this article. Deep learning requires training Deep neural networks (DNN) with massive number of parameters on a huge amount of data. Distributed computing is a perfect tool to take advantage of the modern hardware to its fullest. Here is the core idea:
A properly crafted distributed algorithm can:
One more terminology you have to get used to — Message Passing Interface (MPI). MPI is the workhorse of almost all of distributed computing. MPI is an open standard that defines a set of rules on how the nodes will talk to each other over network and also a programming model/API. MPI is not a software or tool, it’s a specification. A group of individuals, organizations from academia and industry came forward in the summer of 1991 which eventually led to the creation of MPI Forum. The forum, with a consensus, crafted a syntactic and semantic specification of a library that is to be served as a guideline for different hardware vendors to come up with portable/ flexible/optimized implementations. Several hardware vendors have their own implementation of MPI — “OpenMPI”, “MPICH”, “MVAPICH”, “Intel MPI” and lot more.
In this tutorial, we are going to use Intel MPI as it is very performant and also optimized for Intel platforms. Original Intel MPI is a C library and very low level in nature.
Proper setup of a distributed system is very important. Without proper hardware and network arrangements, it’s pretty much useless even if one has conceptual understanding of it’s programming model. Below are the key arrangements need to be made:
There are two popular ways of parallelizing Deep learning models:
Model parallelism refers to a model being logically split into several parts (i.e., some layers in one part and some in other), then placing them on different hardware/devices. Although placing the parts on different devices does have benefits in terms of execution time (asynchronous processing of data), it is usually employed to avoid memory constraints. Models with very large number of parameters, which are difficult fit into a single system due to high memory footprint, benefits from this type of strategy.
Data parallelism, on the other hand, refers to processing multiple pieces (technically batches) of data through multiple replicas of the same network located on different hardware/devices. Unlike model parallelism, each replica may be an entire network and not just a part of it. This strategy, as you might
have guessed, can scale up well with increasing amount of data. But, as the entire network has to reside on a single device, it cannot help models with high memory footprints. The illustration below should make it clear.
Practically, Data parallelism is more popular and frequently employed in large organizations for executing production quality DL training algorithms. So, in this tutorial, we will fix our focus on data parallelism.
torch.distributed"
APIPyTorch offers a very elegant and easy-to-use API as an interface to the underlying MPI library written in C. PyTorch needs to be compiled from source and must be linked against the Intel MPI installed on the system. We will now see the basic usage of torch.distributed
and how to execute it.
# filename 'ptdist.py' import torch import torch.distributed as dist def main(rank, world): if rank == 0: x = torch.tensor([1., -1.]) # Tensor of interest dist.send(x, dst=1) print('Rank-0 has sent the following tensor to Rank-1') print(x) else: z = torch.tensor([0., 0.]) # A holder for recieving the tensor dist.recv(z, src=0) print('Rank-1 has recieved the following tensor from Rank-0') print(z)if name == ‘main’:
dist.init_process_group(backend=‘mpi’)
main(dist.get_rank(), dist.get_world_size())
Executing the above code using mpiexec
, a distributed process scheduler comes with any standard MPI implementation, results in:
cluster@miriad2a:~/nfs$ mpiexec -n 2 -ppn 1 -hosts miriad2a,miriad2b python ptdist.py
Rank-0 has sent the following tensor to Rank-1
tensor([ 1., -1.])
Rank-1 has recieved the following tensor from Rank-0
tensor([ 1., -1.])
dist.init_process_group(backend)
which basically sets up the internal communication channel among the participating nodes. It takes an argument to specify which backend to use. As we are using MPI throughout, its backend=’mpi’
in our case. There are other backends as well (like “TCP”, “Gloo” and “NCCL”).mpiexec
invocation (see the -hosts
flag in mpiexec
).-hosts
is used to assign the numbers. So, in this case, the process on node “miriad2a” will be assigned Rank 0 and “miriad2b” will be Rank 1.x
is a tensor that Rank 0 intends to send to Rank 1. It does so by dist.send(x, dst=1)
.z
is something that Rank 1 created before receiving the tensor. We need an already created tensor of same shape as a holder for receiving the incoming tensor. The values of z
will eventually be replaced by the value of x
.dist.send(…)
, the receiving counterpart is dist.recv(z, src=0)
which receives the tensor into z
.What we saw in the last section is an example of “peer-to-peer” communication where rank(s) send data to specific rank(s) in a given context. Although this is useful as it provides user with granular control over the communication, there exist other standard and frequently used patterns of communication called collectives. Below is the description of one particular collective (known as all-reduce) which is of interest to us in the context of Synchronous SGDalgorithm.
All-reduce is a way of synchronized communication where a given reduction operation is operated on all the ranks and the reduced result is made available to all of them. The below figure illustrates the idea (uses summation as the reduction operation).
def main(rank, world):
if rank == 0:
x = torch.tensor([1.])
elif rank == 1:
x = torch.tensor([2.])
elif rank == 2:
x = torch.tensor([-3.])dist.all_reduce(x, op=dist.reduce_op.SUM) print('Rank {} has {}'.format(rank, x))
if name == ‘main’:
dist.init_process_group(backend=‘mpi’)
main(dist.get_rank(), dist.get_world_size())
When launched in a world of 3, results in
cluster@miriad2a:~/nfs$ mpiexec -n 3 -ppn 1 -hosts miriad2a,miriad2b,miriad2c python ptdist.py
Rank 1 has tensor([0.])
Rank 0 has tensor([0.])
Rank 2 has tensor([0.])
if rank == <some rank> … elif
is a pattern we encounter again and again in distributed computing. In this case, it is used to create different tensors on different ranks.dist.all_reduce(…)
is outside if … elif
block) with summation (dist.reduce_op.SUM
) as reduction operation.x
from every rank is summed up and the summation is placed inside the same x
of every rank.It is assumed that the reader is familiar with the standard Stochastic Gradient Descent (SGD) algorithm which is often used to train deep learning models. We will now see a variant of SGD (called Synchronous SGD) that makes use of the All-reduce collective to scale up. To lay the foundation, let’s start with the mathematical formulation of standard SGD.
where D is a set (mini-batch) of samples, θ is the set of all parameters, λ is the learning rate and Loss(X, y)
is some loss function averaged over all samples in D.
The core trick that Synchronous SGD relies on is splitting the summation in the update rule over smaller subsets of (mini)batches. D is split into R number of subsets D₁, D₂, . . (preferably with same number of samples in each) such that
Splitting the summation of standard SGD update formula leads to
Now, as the gradient operator is distributive over summation operator, we get
Have a look at those individual gradient terms (inside square brackets) in the above equation. They can now be computed independently and summed up to get the original gradient without any loss/approximation. This is where the data parallelism comes into picture. Here is the whole story:
4. Sum up all the gradients of all the ranks and make the resulting gradient available to all of them to proceed further.
The last point is exactly the all-reduce algorithm. So, all-reduce must be executed every time all ranks have computed one gradient (on a mini-batch of size B) on their own portion of the dataset. A subtle point to note here is that summing up the gradients (on mini-batches of size B) from all R ranks leads
to an effective batch size of
The following are the crucial parts of the implementation (the boilerplate codes are not shown)
model = LeNet()first synchronization of initial weights
sync_initial_weights(model, rank, world_size)
optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.85)
model.train()
for epoch in range(1, epochs + 1):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()# The all-reduce on gradients sync_gradients(model, rank, world_size) optimizer.step()
def sync_initial_weights(model, rank, world_size):
for param in model.parameters():
if rank == 0:
# Rank 0 is sending it’s own weight
# to all it’s siblings (1 to world_size)
for sibling in range(1, world_size):
dist.send(param.data, dst=sibling)
else:
# Siblings must recieve the parameters
dist.recv(param.data, src=0)def sync_gradients(model, rank, world_size):
for param in model.parameters():
dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)
sync_initial_weights(…)
routine does exactly that. Let any one of the ranks send its weights to its siblings and the siblings must receive them to initialize themselves with it.sync_gradients(…)
routine does the gradient synchronization.optimizer.step()
does the job as usual.Now a question might arise, “How do we ensure that independent updates will remain in sync?”.
If we take a look at the update equation for the first update
Point 2 & 4 above ensure that the initial weights and the gradients are synchronized individually. For obvious reason, a linear combination of them will also be in sync (λ is a constant). A similar logic holds for all consecutive updates.
The biggest bottleneck for any distributed algorithm is the synchronization. Distributed algorithms are beneficial only if the synchronization time is significantly less than computation time. Let’s have a simple comparison between the standard and synchronous SGD to see when is the later one beneficial.
Definitions. Let’s assume the size of the entire dataset is N. Mini-batches of size B are processed by the network which takes time Tcomp. In the distributed case, time taken for all-reduce synchronization is Tsync. If there are Rreplicas, time taken for one epoch
So, for the distributed setting to be significantly beneficial over non-distributed one, we need to have
OR, equivalently
The three factors contributing to the above inequality can be tweaked to extract more and more benefit out of the distributed algorithm.
Hopefully, the article was clear enough to convey the central idea of Distributed Computing in the context of Deep Learning. Although, Synchronous SGD is quite popular, there are other distributed algorithms which are also used quite frequently (like Asynchronous SGD and its variants). But, what is more important is to be able to think about deep learning methods in a parallel manner. Please realize that not all algorithms can be parallelized out-of-the-box; some require approximations to be made which break theoretical guarantees given by the original algorithms. It is up to the algorithm designer/implementer to tackle these approximations in an efficient way.
Originally published by Ayan Das at medium.com
=========================
Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter
☞ Data Science, Deep Learning, & Machine Learning with Python
☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks
☞ Machine Learning A-Z™: Hands-On Python & R In Data Science
☞ Deep Learning A-Z™: Hands-On Artificial Neural Networks
☞ Complete Guide to TensorFlow for Deep Learning with Python
☞ Data Science, Deep Learning, & Machine Learning with Python
☞ Deep Learning: Recurrent Neural Networks in Python
☞ Machine Learning & Tensorflow - Google Cloud Approach
☞ Python for Data Science and Machine Learning Bootcamp
☞ Machine Learning, Data Science and Deep Learning with Python
☞ [2019] Machine Learning Classification Bootcamp in Python
#python #deep-learning