It doesn’t matter what research project I’m working on, having the right infrastructure in place is always a critical part for success. It’s very simple: assuming all other “parameters” are equal, the faster/cheaper one can iterate, the better results could be achieved for the same time/cost budget. As these are always limited, good infrastructure is usually a make or break for a breakthrough.

When it comes to data science, and deep learning specifically, the ability to easily distribute the training process on a strong HW, is a key for achieving fast iterations. Even the brightest idea requires a few iterations to get polished and validated, e.g. for checking different pre-processing options, network architectures, and just standard hyperparameters such as batch size and learning rate.

In this post, I’d like to show you how easy (and cheap, if you want) it is to distribute existing distribution-ready PyTorch training code on AWS SageMaker using simple-sagemaker(assuming distribution is done on top of the supported gloo or nccl backends)_. _Moreover, you’ll see how to easily monitor and analyze resource utilization and key training metrics in real time, during the training.

Simple-sagemaker

Simple-sagemaker is thin wrapper around AWS SageMaker, that makes distribution of work on any supported instance type **very simple **and cheap. A quick introduction can be found in this blog post.

#distributed-training #pytorch #sagemaker #data-science #deep-learning

Single line distributed PyTorch training on AWS SageMaker
5.80 GEEK