Deep learning requires a ton of compute, so to effectively train deep learning models most people use the cloud like AWS or train on their own hardware that they’ve purchased.

I decided to build my own machine because I was spending a lot of money in the cloud and wanted to save on cost. Introducing… Warmachine! Warmachine is a very capable machine built to take on advanced A.I. tasks from deep learning to reinforcement learning. Warmachine was built using consumer of the shelf hardware.

Check out Warmachine’s parts at Pc Parts Picker with the link below.

Saved Part Lists

[Edit description]

[pcpartpicker.com]

In this article, I’m going to give tips on how to properly build an A.I. training machine. Towards the end, I’ll talk about the advantages and disadvantages of building your own rig vs using the cloud.

If you want to see the video version of this instead check this out…

The Parts

So Warmachine was primarily built to tackle deep learning and reinforcement learning problems. I wanted a machine with a healthy amount of cores with 4 GPUs so I can iterate quickly on training my machine learning models. My end goal is to eventually have something similar to a Lambda Quad machine but without paying lambda quad prices.

Lambda Lab Quad for a 4 GPU RTX 2080 TI machine

When Warmachine is complete I’ll spend about $7k which is $4k cheaper than Lambda Labs!

Deep learning rigs require particular components so it was harder than usual to find reliable resources online on how to build one of these things. Let’s walk through all you need to know to build your own deep learning machine.

GPU

At the heart of training deep learning models is the GPU. GPUS are super fast at computing deep learning algorithms because unlike CPUs with a very small amount of complex computing core, GPUS have hundreds or thousands of simple cores that are super-efficient at matrix multiplication. The most reliable brand for GPU’s for deep learning is Nvidia. Most Deep learning frameworks fully support Nvidia’s CUDA SDK which is a software library to interface with their GPUs.

When picking out a GPU, to get the most bang out of your buck, you want something with tensor cores. Tensor cores are a type of processing core that performs specialized matrix math, enabling you to train models using half-precision or mix precision.

This allows more efficient usage of GPU memory which opens up the door for bigger batch sizes, faster training, and bigger models. Tensor cores can be found in the Nvidia RTX GPU models. The memory needs for your GPU is dependent on the type of models you plan on training.

If you only plan on training super small models for embedded devices, you can get away with a GPU with less memory. If you plan on training bigger models like GPT from the NLP domain, I would get as much memory as possible.

Having more GPU memory opens up the door to… you guess it… bigger batch sizes, faster training, and bigger models. If you plan on doing a multi GPU setup, you need to go with either blower-style fans or the more expensive option liquid cooling.

You need blower-style fans because they are built to expel heat out of the case, which is necessary when you have multiple GPU running. If you don’t have blower-style fans, your system can overheat and potentially damage your hardware.

For Warmachine, I went with an Nvidia RTX 2080 TI Turbo from ASUS. It has 11GB of VRam and blower-style fans for better heating management on a multi GPU set up. I plan on buying 3 more GPU’s down the road to complete my setup.

CPU

CPU’s are mainly used for data loading in Deep learning. More threads on a CPU means you can load more data in parallel to feed into your models for training. This is useful if you train on big batch sizes, so the GPU doesn’t have to wait too long for the CPU to load data.

CPUs are important if you plan on doing reinforcement learning problems because most of the computation will be done in your learning environment which is most likely done on the CPU. If you use large neural networks with reinforcement learning, then a GPU would definitely help speed up training.

If you only plan on only doing deep learning, then make sure your CPU is compatible with however many GPU’s you plan on having.

#gpu #buildapc #machine-learning #deep-learning #technology #deep learning

Build Your Own Deep Learning Machine — What you need to know
1.45 GEEK