Ask a beginner why ML is so difficult and you will most likely get an answer in the lines of ‘the math behind is really complicated’ or ‘I don’t fully understand what all those layers do’. While that is obviously true and certainly interpreting ML models is a muddy subject, the truth is that ML is difficult because more often than not the data we have cannot live up to the complexity of our models. This is a very common issue in practice and since your models are only as good as your data is, I have gathered some of the most relevant guidelines to be used when you face shortage of data.

How much data do you actually need?

This is something everyone working with data has wondered at some point. Unfortunately, there is no set of fixed rules that will give you a direct answer and you can only resort to guidelines and experience.

First of all you should consider how complex your problem is. It is not the same predicting customer behaviour as differentiating between cats and dogs. After all, some people may be unreadable for you as a fellow human, but certainly you wouldn’t find yourself sweating over telling apart a dog from a cat.

Besides, your choice of algorithm will also determine the adequate size of your set. More complex models such as deep neural networks are able to capture much greater detail than their linear counterparts at the expense of higher set size requirements.

If any of these guidelines fail (which could perfectly be the case), same as you would get inspiration from the existing literature to choose one or another model, take a look at what input was used. After all, not many folks are reinventing the wheel out there so you will have a high chance that someone has faced a similar problem.

#machine-learning #data-augmentation #transfer-learning #tensorflow #keras

So you think you don’t have enough data to do Machine Learning
1.05 GEEK