Cutting edge deep learning models are growing at an exponential rate: where last year’s GPT-2 had ~750 million parameters, this year’s GPT-3 has 175 billion. GPT is a somewhat extreme example; nevertheless, the “enbiggening” of the SOTA is driving larger and larger models into production applications, challenging the ability of even the most powerful of GPU cards to finish model training jobs in a reasonable amount of time.
To deal with these problems, practitioners are increasingly turning to distributed training. Distributed training is the set of techniques for training a deep learning model using multiple GPUs and/or multiple machines. Distributing training jobs allow you to push past the single-GPU memory bottleneck, developing ever larger and powerful models by leveraging many GPUs simultaneously.
This blog post is an introduction to the distributed training in pure PyTorch using the
torch.nn.parallel.DistributedDataParallel API. We will:
DistributedDataParallel and show how they are used by example
Before we can dive into
DistributedDataParallel, we first need to acquire some background knowledge about distributed training in general.
There are basically two different forms of distributed training in common use today: data parallelization and model parallelization.
In data parallelization, the model training job is split on the data. Each GPU in the job receives its own independent slice of the data batch, e.g. its own “batch slice”. Each GPU uses this data to independently calculate a gradient update. For example, if you were to use two GPUs and a batch size of 32, one GPU would handle forward and back propagation on the first 16 records, and the second the last 16. These gradient updates are then synchronized among the GPUs, averaged together, and finally applied to the model.
(the synchronization step is technically optional, but theoretically faster asynchronous update strategies are still an active area of research)
#distributed-systems #pytorch #neural-networks #neural networks
Cutting edge deep learning models are growing at an exponential rate: where last year’s GPT-2 had ~750 million parameters, this year’s GPT-3 has 175 billion.