There are two main approaches to training models across multiple devices; model parallelism, where the model is split across the devices, and data parallelism, where the model is replicated across every device, and each replica is trained on a subset of the data. Let’s look at these two options closely to understand how training models across multiple devices works.

Training Models using Model Parallelism

So far we have trained each neural network on a single device. What if we want to train a single neural network across multiple devices? This requires chopping the model into separate chunks and running each chunk on a different device. Unfortunately, such model parallelism turns out to be pretty tricky, and it depends on the architecture of your neural network. For fully connected networks, there is generally not much to be gained from this approach. Intuitively, it may seem that an easy way to split the model is to place each layer on a different device, but this does not work because each layer needs to wait for the output of the previous layer before it can do anything.

So perhaps you can slice it vertically for example, with the left half of each layer on one device, and the right part on another device? This is slightly better since both halves of each layer can indeed work in parallel, but the problem is that each half of the next layer requires the output of both halves, so there will be a lot of cross-device communication. This is likely to completely cancel out the benefit of the parallel computation since cross-device communication is slow.

#by aman kharwal #artificial intelligence #data science #deep learning #machine learning #training models

Training Models Across Multiple Devices
1.50 GEEK