Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power. It is like a lever you always have when you are driving a car. So more data leads to more predictive power. For sophisticated models such as gradient boosted trees and random forests, quality data and feature engineering reduce the errors drastically.

But simply having more data is not useful. The saying that businesses need a lot of data is a myth. Large amounts of data afford simple models much more power; if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer. If you have 10 data points, this is probably not the case. You’ll have to perform more sophisticated normalization and transformation routines on the data before it is useful.

The big data paradigm is the assumption that big data is a substitute for conventional data collection and analysis. In other words, it’s the belief (and overconfidence) that huge amounts of data is the answer to everything and that we can just train machines to solve problems automatically. Data by itself is not a panacea and we cannot ignore traditional analysis.

#opinions #data #data-science

Is More Data Always Better For Building Analytics Models?
1.10 GEEK