GPUs have proven extremely useful for highly parallelizable data processing use-cases. The computational paradigms found in machine learning & deep learning for example fit extremely well to the processing architecture graphics cards provide.
One would assume that GPUs would be able to process any submitted tasks concurrently — the internal steps within a workload are indeed run in parallel, however separate workloads are actually processed sequentially. Recent improvements in graphics card architectures are now enabling for hardware parallelization across multiple workloads, which can be achieved by submitting the workloads to different underlying physical GPU queues. Practical tecniques in machine learning that would benefit from this include model parallelism and data parallelism.
In this example we will show how we can achieve a 2x speed improvement on a synchronous example by simply submitting the workload across two queue families. This will be an important optimization technique, as recent announcements from NVIDIA’s Ampere GA10x architecture will enable for 3x speed improvements, making it clear that this trend will only continue to bring further opportunities in this area.
We will be implementing this using Vulkan and the Vulkan Kompute framework. More specifically we will cover:
You can find the full code in this file — instructions on how to run the full suite using CMAKE can be found in the main Kompute repository build section.
#parallel-computing #deep-learning #machine-learning #vulkan #gpu