How to Increase Your Efficiency and Reduce Cost When Training in the Cloud

This blog post accompanies a talk I gave at AWS re:Invent 2020, in which I described some of the ways in which my team at Mobileye, (officially known as Mobileye, an Intel Company), uses Amazon SageMaker Debugger in its daily DNN development.

Monitoring the Learning Process

A critical part of training machine learning models, and particularly deep neural networks (DNNs), is monitoring one’s learning process. (This is sometimes called babysitting one’s learning process.) Monitoring the learning process, refers to the art of tracking different metrics during training, in order to evaluate how the training is proceeding, and, determine what hyper-parameters to tune in order to improve the training.

On our team, we track a broad range of metrics, which can be broadly divided into three categories:

  • Training metrics are used to measure the rate at which the model training is converging. These include monitoring the losses, and the distributions of the gradients and activations.
  • Prediction metrics measure the model’s ability to make predictions. Common metrics include model accuracy, precision, recall, etc. If you are working on a computer vision problem, then a visualization of your model’s prediction might also serve as a metric.
  • System utilization metrics measure the degree to which the training system resources are being utilized, draw attention to bottlenecks in the training pipeline, and indicate potential ways in which the training throughput can be accelerated.

#debugging #performance #sagemaker #tensorflow #machine-learning

Upgrade Your DNN Training with Amazon SageMaker Debugger
1.25 GEEK