Yes, but: In recent years, studies have shown that these datasets may contain serious gaps. ImageNet, for example, contains racist and sexist labels as well as photos of people’s faces obtained without consent. The latest study now examines another dimension: the fact that many labels are just plain fake. A mushroom is labeled a spoon, a frog is labeled a cat, and an Ariana Grande high score is labeled a whistle. The ImageNet test set has an estimated label error rate of 5.8%. Meanwhile, the test set for QuickDraw, a compilation of hand drawn drawings, has an estimated error rate of 10.1%.
How was it measured? Each of the ten datasets used to evaluate the models has a corresponding dataset used to train them. MIT graduate students, Curtis G. Northcutt and Anish Athalye and alum Jonas Mueller, used the training data sets to develop a machine learning model, and then used it to predict labels in the data from test. If the model did not agree with the original label, the data point was flagged for manual review. Five human Amazon Mechanical Turk reviewers were asked to vote on which label – the model’s or the original – they thought was correct. If the majority of human reviewers agreed with the model, the original tag was counted as an error and then corrected.
Is it important? Yes. The researchers looked at 34 models that had previously been measured against the ImageNet test set. They then re-measured each model against the 1,500 or so examples where the data labels turned out to be wrong. They found that the models that did not perform so well on the original Incorrect the labels were among the best performers after the label correction. In particular, the simpler models seemed to take better advantage of the corrected data than the more complicated models used by tech giants like Google for image recognition and assumed to be the best in the business. In other words, we may get an exaggerated idea of how great these complicated models are due to faulty test data.
Now what? Northcutt encourages the AI field to create cleaner datasets to evaluate models and track the progress of the field. He also recommends that researchers have better data hygiene when working with their own data. “If you have a noisy dataset and a bunch of models that you try out and are going to deploy them in the real world,” he says, you could end up selecting the wrong model without cleaning up the test data. To this end, he opened the code he used in his study to correct label errors, which he says are already being used in a few large tech companies.