Weight and bias are the adjustable parameters of a neural network, and during the training phase, they are changed using the gradient descent algorithm to minimize the cost function of the network. However, they must be initialized before one can start training the network, and this initialization step has an important effect on the network training. In this article, I will first explain the importance of the wight initialization and then discuss the different methods that can be used for this purpose.
Notation
Currently Medium supports superscripts only for numbers, and it has no support for subscripts. So to write the name of the variables, I use this notation: Every character after ^ is a superscript character and every character after _ (and before ^ if its present) is a subscript character. For example
is written as w_ij^[l] in this notation.
Before we discuss the weight initialization methods, we briefly review the equations that govern the feedforward neural networks. For a detailed discussion of these equations, you can refer to reference [1]. Suppose that you have a feedforward neural network as shown in Figure 1. Here
is the network’s input vector. Each x_i is an input feature. The network has L layers and the number of neurons in layer l is n^[l]. The input layer is considered as layer zero. So the number of input features is n^[0]. The output or activation of neuron i in layer l is a_i^[l].
Figure 1 (Image by Author)
The wights for the neuron i in layer _l _can be represented by the vector
where w_ij^[l] represents the weight for the input j (coming from neuron j in layer l-1) going into neuron i in layer l (Figure 2).
Figure 2 (Image by Author)
In layer l, each neuron receives the output of all the neurons in the previous layer multiplied by its weights, w_i1, w_i2, . . . , w_in. The weighted inputs are summed together, and a constant value called bias (b_i^[l]) is added to them to produce the net input of the neuron
The net input of neurons in layer l can be represented by the vector
Similarly, the activation of neurons in layer l can be represented by the activation vector
So Eq. 3 can be written as
where the summation has been replaced by the inner product of the weight and activation vectors. The net input is then passed through the activation function g to produce the output or activation of neuron i
We usually assume that the input layer is the layer zero and
So for the first layer, Eq. 7 is written as
We can combine all the weights of a layer into a weight matrix for that layer
So W^[l] is an n^[l] × n^[l-1] matrix, and the (i,j) element of this matrix gives the weight of the connection that goes from the neuron j in layer l-1 to the neuron i in layer l. We can also have a bias vector for each layer
#neural-networks #weight-initialization #deep-learning #machine-learning