I performed Error Analysis on Open Images and now I have trust issues

Image for post

An example of a false positive caused by missing ground truth on the Open Images dataset

Modern Benchmark Datasets

As the performance of deep learning models trained on massive datasets continues to advance, large-scale dataset competitions have become the proving ground for the latest and greatest computer vision models. We’ve come a long way as a community from the times where MNIST — a dataset with only 70,000 28x28 pixel images — was the de facto standard. New, larger datasets have arisen out of a desire to train more complex models to solve more challenging tasks: ImageNet, COCO and Google’s Open Images are among the most popular.

But even on these huge datasets the differences in performance of top models is becoming narrow. The 2019 Open Images Detection Challenge shows the top five teams fighting for less than a 0.06 margin in mean average precision (mAP). It’s even less for COCO.

There’s no doubt that our research community is delivering when it comes to developing innovative new techniques to improve model performance, but the model is only half of the picture. Recent findings have made it increasingly clear that the other half — the data — plays at least as critical of a role, perhaps even greater.

Just this year…

…researchers at Google and DeepMind reassessed ImageNet and their findings suggest that the developments of late may not even be finding meaningful generalizations, instead just overfitting the idiosyncrasies of the ImageNet labeling procedure.
…MIT has widthdrawn the Tiny Images dataset after a paper brought to light that a portion of the 80 million images contained racist and misogynistic slurs.
…Jo and Gebru from Stanford and Google, respectively, argued that more attention needs to be put on data collection and annotation procedures by drawing analogy to more matured data archival procedures.
…researchers from UC Berkeley and Microsoft performed a study showing that when using self-supervised pre-training, one could achieve gains on downstream tasks by focusing not on the network architecture or task/loss selection, but on a third axis, the data itself. To paraphrase: focusing on the data is not only a good idea, it’s a novel idea in 2020!

And here’s what two leaders of the field are saying about this:

“In building practical systems, often there’s more manual error analysis and more human insight that goes into these systems than sometimes deep learning researchers like to acknowledge.” — Andrew Ng
“Become one with the data” — Andrej Karpathy in his popular essay on training neural networks

How many times have you found yourself spending hours, days, weeks pouring over samples in your data? Have you been surprised by how much manual inspection was necessary? Or can you think of a time when you trusted macro statistics perhaps more than you should?

The computer vision community is starting to wake up to the idea that we need to be close to the data. If we want accurate models that behave as expected, it’s not enough to have a large dataset; it needs to have the right data and it needs to be accurately labeled.

Every year, researchers are battling it out to climb to the top of a leaderboard with razor thin margins determining fates. But do we really know what’s going on with these datasets? Is a 0.01 margin in mAP even meaningful?

#open-images #fiftyone #visualization #evaluation #machine-learning

Modern Benchmark Datasets

towardsdatascience.com

I performed Error Analysis on Open Images and now I have trust issues