Deborah Raji, a member of the nonprofit Mozilla, and Genevieve Fried, who advises members of the US Congress on algorithmic liability, examined more than 130 facial recognition datasets compiled over 43 years. They found that researchers, driven by the explosion in deep learning data needs, gradually abandoned people’s requests for consent. This has led to more and more personal photos of people being incorporated into surveillance systems without their knowledge.
It has also led to much messier data sets: they may unintentionally include photos of minors, use racist and sexist labels, or have inconsistent quality and lighting. This trend could help explain the growing number of cases in which facial recognition systems have failed with troubling consequences, such as the false arrests of two black men in the Detroit area last year.
People were extremely careful about collecting, documenting and verifying data on faces at first, Raji says. “Now we don’t care anymore. It was all abandoned, ”she says. “You just can’t follow a million faces. After a while, you can’t even pretend to be in control.
A history of facial recognition data
Researchers have identified four major eras of facial recognition, each driven by a growing desire to improve technology. The first phase, which lasted until the 1990s, was largely characterized by intensive manual methods and slow in computation.
But then, driven by the realization that facial recognition could track and identify individuals more effectively than fingerprints, the U.S. Department of Defense injected $ 6.5 million to create the first large-scale facial data set. ladder. Over 15 photography sessions in three years, the project captured 14,126 images of 1,199 people. The Facial Recognition Technology Database (FERET) was published in 1996.
The following decade saw an increase in academic and commercial research on facial recognition, and many more datasets were created. The vast majority were obtained through photo shoots like the one at FERET and obtained the full consent of the participants. Many also included meticulous metadata, Raji says, such as the age and ethnicity of the subjects, or lighting information. But these early systems struggled in real-world settings, prompting researchers to seek out larger and more diverse datasets.
In 2007, the publication of the Labeled Faces in the Wild (LFW) dataset opened the doors for data collection through web research. Researchers began downloading images directly from Google, Flickr, and Yahoo without worrying about consent. LFW also relaxed standards for the inclusion of minors, using photos found with search terms such as “baby,” “juvenile,” and “adolescent” to increase diversity. This process made it possible to create much larger datasets in a short time, but facial recognition still faced the same challenges as before. This prompted researchers to search for even more methods and data to overcome the technology’s poor performance.
Then, in 2014, Facebook used its user photos to train a deep learning model called DeepFace. Although the company never released the dataset, the superhuman performance of the system elevated deep learning to a de facto method of analyzing faces. It was at this point that manual verification and labeling became almost impossible as the datasets reached tens of millions of photos, Raji says. It’s also when some really weird phenomena start to pop up, like auto-generated labels that include offensive terminology.