A Guide to (Highly) Distributed DNN Training

These days data distributed training is all the rage. In data distributed training learning is performed on multiple workers in parallel. The multiple workers can reside on one or more training machines. Each worker starts off with its own identical copy of the full model and performs each training step on a different subset (local batch) of the training data. After each training step it publishes its resultant gradients and updates its own model taking into account the combined knowledge learned by all of the models. Denoting the number of workers by k and the local batch size by b, the result of performing distributed training on k workers is that at each training step, the model trains on a global batch size of k*b samples. It is easy to see the allure of data distributed training. More samples per train step means faster training, faster training means faster convergence, and faster convergence means faster deployment. Why train ImageNet for 29 hours if we can train it in one? Why train BERT for 3 days if we can train it in just 76 minutes? For especially large or complex networks, distributed training is all but essential for the model to train in a time period that is low enough for the model to be usable.

#deep-learning #machine-learning #tensorflow #distributed-training #horovod

towardsdatascience.com

A Guide to (Highly) Distributed DNN Training