Debugging in TensorFlow

How to Debug a TensorFlow Training Program Without Losing Your Mind. Debugging in TensorFlow

In some of my previous posts, I told you a bit about how my team at Mobileye, (officially known as Mobileye, an Intel Company), uses  TensorFlow, the  Amazon SageMaker and  Amazon s3 to train our deep neural networks on large quantities of data. In this post, I want to talk about debugging in TensorFlow.

It is well known, that program debugging is an integral part of software development, and that the time that is spent debugging, often eclipses the time that it takes to write the original program.

Debugging is hard, and much has been written about how to design and implement one's program in order to increase the reproducibility of bugs, and ease the process of root cause analysis.

In machine learning, the task of debugging is complicated by the stochasticity that is inherent to machine learning algorithms, and by the fact that the algorithms are run on dedicated HW accelerators often on remote machines.

Debugging in TensorFlow is further complicated due to the use of symbolic execution (a.k.a. graph mode), that boosts the runtime performance of the training session, but, at the same time, limits the ability to freely read arbitrary tensors in the graph, a capability that is important for debugging.

In this post, I will expand on the difficulties of debugging TensorFlow training programs, and provide some suggestions for how to address those difficulties.

For legal purposes, I want to clarify that despite my carefully chosen subtitle, I provide no guarantees that anything I write here will prevent you from losing your mind. On the contrary, I think that I can all but guarantee that you probably will lose your mind when debugging your TensorFlow program, despite anything I write. But, perhaps, you will lose your mind just a little bit less.

Before we begin, let's clarify the scope of our discussion.

