5 Things I’ve Learned from High Impact Machine Learning Projects. The majority of machine learning projects fail. How can you insure the success of your high impact project?

I’ve been working on computer vision and machine learning projects for about fifteen years now — most recently on pathology and remote sensing applications. Here are just a few of the things I’ve learned.

1. Data matters — and it can be messy

Machine learning depends on training data. In the case of supervised learning, both data and an associated set of labels that the model will predict on novel examples.

The number of training examples is a critical factor in being able to train a good model. When too few training examples are available to train a complex model, the model simply over-fits — it does not generalize to unseen data. But with medical imaging applications, a few hundred images is often all we get. A couple thousand images might be considered a large dataset. This makes training a good model challenging and may require specialized techniques.

But quantity isn’t the only factor. I’m currently working on a project to predict power plant emissions from satellite images. The quality of our ground truth data really matters. We need to be sure that the geolocation of each power plant is correct and that this location correctly maps with a database describing the type of fuel the plant is burning and with a different database that provides a time series of emissions readings. If any of these mappings are incorrect, then garbage in translates into garbage out.

Quantity and quality of data are both critical to a successful machine learning solution. And, in many cases, neither is easy to achieve.

