Training machine learning/deep learning models can take a really long time, and understanding what is happening as your model is training is absolutely crucial.

Typically you can monitor:

  • Metrics and losses
  • Hardware resource consumption
  • Errors, warnings, and other logs kept (stderr and stdout)

Depending on the library or framework, this can be easier or more difficult, but pretty much always it is doable.

Most libraries allow you to monitor your model training in one of the following ways:

  • You can add a monitoring function at the end of the training loop
  • You can add a monitoring callback either on iteration (batch) or epoch end.
  • Some monitoring tools can hook to the training loop magically by parsing logs or monkey patching.

Let me show how to monitor machine learning models in each case.

#machine-learning #deep-learning #data-science #keras #pytorch

A Guide to Monitoring ML and Deep Learning Experiments
1.90 GEEK