Parallelizing GPU-intensive Workloads via Multi-Queue Operations

GPUs have proven extremely useful for highly parallelizable data processing use-cases. The computational paradigms found in machine learning & deep learning for example fit extremely well to the processing architecture graphics cards provide.

One would assume that GPUs would be able to process any submitted tasks concurrently — the internal steps within a workload are indeed run in parallel, however separate workloads are actually processed sequentially. Recent improvements in graphics card architectures are now enabling for hardware parallelization across multiple workloads, which can be achieved by submitting the workloads to different underlying physical GPU queues. Practical tecniques in machine learning that would benefit from this include model parallelism and data parallelism.

In this example we will show how we can achieve a 2x speed improvement on a synchronous example by simply submitting the workload across two queue families. This will be an important optimization technique, as recent announcements from NVIDIA’s Ampere GA10x architecture will enable for 3x speed improvements, making it clear that this trend will only continue to bring further opportunities in this area.

We will be implementing this using Vulkan and the Vulkan Kompute framework. More specifically we will cover:

Disambiguation of “asynchronous” and “parallel” in GPU processing
A base synchronous example that we will build upon
Steps to extend the example for asynchronous workload submission
Steps to extend the example for parallel multi-queue GPU processing

You can find the full code in this file — instructions on how to run the full suite using CMAKE can be found in the main Kompute repository build section.

#parallel-computing #deep-learning #machine-learning #vulkan #gpu

towardsdatascience.com

Parallelizing GPU-intensive Workloads via Multi-Queue Operations