Last year, Facebook AI introducedDynabench, a platform for dynamic data collection and benchmarking that uses humans and NLP models to create challenging test datasets. The humans are tasked with finding adversarial examples that fool current state-of-the-art models.


Facebook has recently updated Dynabench with Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic NLP model comparison. With Dynaboard, you can perform apples-to-apples comparisons dynamically without problems from bugs in evaluation code, backward compatibility, inconsistencies in filtering test data, accessibility, and other reproducibility issues.

Facebook is looking to push the industry towards a more rigorous, real-world evaluation of NLP models. It enables researchers to customise a new ‘Dynascore’ metric based on accuracy, memory, compute, robustness and fairness.

Why Dynaboard?

Every person who uses leaderboards has a different set of preferences and goals. Dynascore evaluates the performance in a nuanced, comprehensive way.

For instance, even a 10x more accurate NLP model may be useless to an embedded systems engineer if it’s untenably large and slow. At the same time, a very fast, accurate model shouldn’t be considered high-performing if it doesn’t work smoothly for everyone. “AI researchers need to be able to make informed decisions about the tradeoffs of using a particular model,” said Facebook.

So far, benchmarks such as MNISTImageNetSquADSNLI and GLUE have played a crucial role in driving progress in AI research. But, they seem to be changing rapidly. Therefore, every time a new benchmark is introduced, researchers chase them instead of solving a persisting problem. That, in a way, hinders the progress of research.

In the last few years, the benchmarks have been saturating rapidly, especially in NLP. For instance, if you look at the above visuals, it took the research community 18 years to achieve human-level performance on MNIST and about six years to surpass humans on ImageNet. In contrast, it took about a year to beat humans on the GLUE benchmark for language understanding.

“Our journey is just getting started,” said Facebook, stating since the launch of Dynabench, it has collected over 400,000 examples and has released two new, challenging datasets. “Now, we have adversarial benchmarks for all four of our initial official tasks within Dynabench, which initially focus on language understanding.”

As part of the initial experiment, Facebook has used Dynaboard to rank current SOTA NLP models, including BERT, RoBERTa, ALBERT, T5, and DeBERTa, on the four core Dynabench tasks.

How does it work?

Dynascore allows researchers to tailor an evaluation by placing greater or less emphasis on a collection of tests.

While the performance of models is in place, Dynabench tracks which examples fool the models and lead to incorrect predictions across the core tasks of natural language interference, question answering, hate speech and sentiment analysis.

Facebook said these examples further improve the systems and become part of more challenging datasets that train new models, which can be benchmarked to create a virtuous cycle of research progress. The way it works is, crowdsourced annotators connect to Dynabench and receive feedback on a model’s response. If annotators disagree with the original label, the example is discarded from the test set.

#opinions #ai benchmarking #benchmarking npl #facebook latest #machine learning latest

Facebook Launches Evaluation-As-A-Service Framework For ML Models
1.10 GEEK