Neural networks are known to be black box predictors where the data scientist does not usually know which particular input feature influenced the prediction the most. This can be rather limiting if we want to get some understanding of what the model actually learned. Having this kind of understanding may allow us to find bugs or weaknesses in our learning algorithm or in our data processing pipeline and thus be able to improve them.

The approach that we will implement in this project is called integrated gradients and it was introduced in the following paper:

In this paper, the authors list some desirable axioms that a good attribution method should follow and prove that their method **Integrated gradients **statisfies those axioms. Some of those axioms are:

  • Sensitivity: If two samples differ only by one feature and have different outputs by the neural network then the attribution of this feature should be non-null. Inversely, if a feature does not influence the output at all then its attribution should be zero.
  • Implementation Invariance: If two networks have the same output for all inputs then their attribution should be the same.

More axioms are available in detail in the paper linked above.

The Integrated Gradient is very easy to implement and use, it only requires the ability to compute the gradient of the output of the neural network with respect to its inputs. This is easily doable in PyTorch, we will detail how it can be done in what follows.

#explainable-ai #deep-learning #python #pytorch #attribution

How much of your Neural Network’s Prediction can be Attributed to each Input Feature?
2.80 GEEK