If your model is not performing at the level you believe it should be performing, you’ve come to the right place! This reference article will detail common issues and solutions in model building.

Neural Networks

Problem: Your neural network performs very well in training but poorly on validation (testing) sets.

Issue: Your neural network is probably overfitting to the data. This means that instead of actually searching for deeper connections within the training examples, it is ‘taking the easy way out’ and simply memorizing all of the data points, which is possible given its large architecture.

Solutions:

Simplify the model architecture. When your neural network has too many layers and nodes, it may give the model the opportunity to memorize the data instead of actually learning generalized patterns. By reducing the storage capacity of a neural network, you are taking away its ability to ‘cheat’ its way to high performance.
Early stopping is a form of regularization and involves stopping a neural network from training right where the testing error is the smallest.
Data augmentation, which applies to images, is a good way to drastically increase the dataset size and hence make it impossible for the neural network to memorize everything. It also helps you get the most out of each image, but it’s important to be careful with how augmentations are performed. For instance, if you allow vertical flips as an augmentation on the MNIST digits dataset, the model will have difficulty differentiating 6 and 9 and will hence be done more bad than good.
Use regularization, which aims to reduce the complexity of the model. L1 weights errors by their absolute value, whereas L2 regularization weights errors by the square of their value. Hence, L2 targets higher errors, and puts disproportionately high penalties on higher errors — because a decrease in error from 5 to 4 is weighted as 25–16 = 9 but a decrease in error from 1 to 0 is weighted as 1–0 = 1, using L2 regularization will yield with coefficients that are very close to zero but not at 0, since there is no large relative incentive to do so. On the other hand, L1 regularization encourages coefficients to continually decreasing if it is profitable to do so, since a decrease from 1000 to 999 is the same as from 1 to 0.
Generally speaking, L2 regularization may be better for more complex tasks and L1 for simpler ones, but which to use has complete dependency on the nature of the task at hand.
Adding dropout as a layer can help reduce a model’s ability to simply memorize information and hence overfit. The dropout layer takes in inputs from the previous layer and randomly blocks a prespecified percent of them each time, forcing the network to adapt. When this disability is introduced, the reasoning goes, the neural network must find a way to select and compress only the most important information in each node in anticipation that some of it is bound to be blocked.

Source: Dropout Paper. Image free to share.

Try using a dimensionality reduction algorithm (PCA or LDA — preferred for class usage) to reduce the dimensionality of the data. These methods will reduce highly correlated variables or features that add unnecessary noise, and makes it easier for the neural network to identify underlying patterns.

Problem: Your neural network seems to be training normally, but for some reason cannot reach a performance you know is possible, for example, a task that has been replicated by others with much higher success. You see that the performance of the neural network improves for the first few epochs, but reaches a plateau much earlier than it should.

Issue: You are probably experiencing the Vanishing Gradient Problem. This is when your neural network is so deep that the backpropagation signal that updates the weights gradually diminishes in strength the farther it travels, such that the neurons at the front are barely even touched. This leads to an inefficiency in usage, and causes the error landscape to be very flat. Because the landscape becomes so flat, the optimizer does not know which direction will yield a reward.

Solutions:

The Vanishing Gradient Problem usually happens when the sigmoid or a sigmoid-like activation function is abundantly used in the neural network. Because of the Bell-curve shape of these functions’ derivatives, any inputted distribution that is not centered at 0 will yield very small values, and hence the distributions of inputs that shift around wildly during the beginning of training are never able to propagate any useful information. Hence, the calculated derivatives do not give any helpful information to the optimizer.
By using an unbounded (not closed) function like ReLU, there is no Vanishing Gradient Problem, since the function does not have a decreasing derivative (on one side, at least), and because it itself has a trainable parameter, which is adaptive and flows with the fluctuations of the data. You can read more about why ReLU has become the revered activation function it is known as today here:

#ai #data-science #data #data analysis #data analysis

Neural Networks

towardsdatascience.com

WTF is Wrong With My Model? Diagnosing Issues and Finding Solutions