The Best Algorithms Struggle to Recognize Black Faces Equally

US government tests find even top-performing facial recognition systems misidentify blacks at rates five to 10 times higher than they do whites.
Image may contain Advertisement and Poster
Beth Holzer/Getty Images

French company Idemia’s algorithms scan faces by the million. The company’s facial recognition software serves police in the US, Australia, and France. Idemia software checks the faces of some cruise ship passengers landing in the US against Customs and Border Protection records. In 2017, a top FBI official told Congress that a facial recognition system that scours 30 million mugshots using Idemia technology helps “safeguard the American people.”

But Idemia’s algorithms don’t always see all faces equally clearly. July test results from the National Institute of Standards and Technology indicated that two of Idemia’s latest algorithms were significantly more likely to mix up black women’s faces than those of white women, or black or white men.

The NIST test challenged algorithms to verify that two photos showed the same face, similar to how a border agent would check passports. At sensitivity settings where Idemia’s algorithms falsely matched different white women’s faces at a rate of one in 10,000, it falsely matched black women’s faces about once in 1,000—10 times more frequently. A one in 10,000 false match rate is often used to evaluate facial recognition systems.

Donnie Scott, who leads the US public security division at Idemia, previously known as Morpho, says the algorithms tested by NIST have not been released commercially, and that the company checks for demographic differences during product development. He says the differing results likely came from engineers pushing their technology to get the best overall accuracy on NIST’s closely watched tests. “There are physical differences in people and the algorithms are going to improve on different people at different rates,” he says.

Computer vision algorithms have never been so good at distinguishing human faces. NIST said last year that the best algorithms got 25 times better at finding a person in a large database between 2010 and 2018, and miss a true match just 0.2 percent of the time. That’s helped drive widespread use in government, commerce, and gadgets like the iPhone.

But NIST’s tests and other studies repeatedly have found that the algorithms have a harder time recognizing people with darker skin. The agency’s July report covered tests on code from more than 50 companies. Many top performers in that report show similar performance gaps to Idemia’s 10-fold difference in error rate for black and white women. NIST has published results of demographic tests of facial recognition algorithms since early 2017. It also has consistently found that they perform less well for women than men, an effect believed to be driven at least in part by the use of makeup.

“White males ... is the demographic that usually gives the lowest FMR,” or false match rate, the report states. “Black females ... is the demographic that usually gives the highest FMR.” NIST plans a detailed report this fall on how the technology works on different demographic groups.

NIST’s studies are considered the gold standard for evaluating facial recognition algorithms. Companies that do well use the results for marketing. Chinese and Russian companies have tended to dominate the rankings for overall accuracy, and tout their NIST results to win business at home. Idemia issued a press release in March boasting that it performed better than competitors for US federal contracts.

Many facial recognition algorithms are more likely to mix up black faces than white faces. Each chart represents a different algorithm tested by the National Institute of Standards and Technology. Those with a solid red line uppermost incorrectly match black women's faces more than other groups.

NIST

The Department of Homeland Security has also found that darker skin challenges commercial facial recognition. In February, DHS staff published results from testing 11 commercial systems designed to check a person’s identity, as at an airport security checkpoint. Test subjects had their skin pigment measured. The systems that were tested generally took longer to process people with darker skin and were less accurate at identifying them—although some vendors performed better than others. The agency’s internal privacy watchdog has said DHS should publicly report the performance of its deployed facial recognition systems, like those in trials at airports, on different racial and ethnic groups.

The government reports echo critical 2018 studies from ACLU and MIT researchers openly wary of the technology. They reported algorithms from Amazon, Microsoft, and IBM were less accurate on darker skin.

Those findings have stoked a growing national debate about the proper, and improper, uses of facial recognition. Some civil liberties advocates, lawmakers, and policy experts want government use of the technology to be restricted or banned, as it was recently in San Francisco and two other cities. Their concerns include privacy risks, the balance of power between citizens and the state—and racial disparities in results. Even if facial recognition worked equally well for all faces, there would still be reasons to restrict the technology, some critics say.

Despite the swelling debate, facial recognition is already embedded in many federal, state, and local government agencies, and it’s spreading. The US government uses facial recognition for tasks like border checks and finding undocumented immigrants.

Earlier this year, the Los Angeles Police Department responded to a home invasion that escalated into a fatal shooting. One suspect was arrested but another escaped. Detectives identified the fugitive by using an online photo to search through a mugshot facial recognition system maintained by Los Angeles County Sheriff’s Office.

Lieutenant Derek Sabatini of the Sheriff’s Office says the case shows the value of the system, which is used by more than 50 county agencies and searches a database of more than 12 million mugshots. Detectives might not have found the suspect as quickly without facial recognition, Sabatini says. “Who knows how long it would have taken, and maybe that guy would not have been there to scoop up,” he says.

The LA County system was built around a face-matching algorithm from Cognitec, a German company that, like Idemia, supplies facial recognition to governments around the world. As with Idemia, NIST testing of Cognitec’s algorithms’ shows they can be less accurate for women and people of color. At sensitivity thresholds that resulted in white women being falsely matched once in 10,000, two Cognitec algorithms NIST tested were about five times as likely to misidentify black women.

Thorsten Thies, Cognitec’s director of algorithm development, acknowledged the difference but says it is hard to explain. One factor could be that it is “harder to take a good picture of a person with dark skin than it is for a white person,” he says.

Sabatini dismisses concerns that—whatever the underlying cause—skewed algorithms could lead to racial disparities in policing. Officers check suggested matches carefully and seek corroborating evidence before taking action, he says. “We’ve been using it here since 2009 and haven’t had any issues: no lawsuits, no cases, no complaints,” he says.

Concerns about the intersection of facial recognition and race are not new. In 2012, the FBI’s top facial recognition expert coauthored a research paper that found commercial facial recognition systems were less accurate for black people and women. Georgetown researchers warned of the problem in an influential 2016 report that said the FBI can search the faces of roughly half the US population.

The issue has gained a fresh audience as facial recognition has become more common, and policy experts and makers more interested in the limitations of technology. The work of MIT researcher and activist Joy Buolamwini has been particularly influential.

Early in 2018 Buolamwini and fellow AI researcher Timnit Gebru showed that Microsoft and IBM services that try to detect the gender of faces in photos were near perfect for men with pale skin but failed more than 20 percent of the time on women with dark skin; a subsequent study found similar patterns for an Amazon service. The studies didn’t test algorithms that attempt to identify people—something Amazon called “misleading” in an aggressive blog post.

Buolamwini was a star witness at a May hearing of the House Oversight and Reform Committee, where lawmakers showed bipartisan interest in regulating facial recognition. Chairman Elijah Cummings (D-Maryland) said racial disparities in test results heightened his concern at how police had used facial recognition during 2015 protests in Baltimore over the death in police custody of Freddie Gray, a black man. Later, Jim Jordan (R-Ohio) declared that Congress needs to “do something” about government use of the technology. “[If] a facial recognition system makes mistakes and those mistakes disproportionately affect African Americans and persons of color, [it] appears to me to be a direct violation of Americans’ First Amendment and Fourth Amendment liberties,” he said.

Why facial recognition systems perform differently for darker skin tones is unclear. Buolamwini told Congress that many datasets used by companies to test or train facial analysis systems are not properly representative. The easiest place to gather huge collections of faces is from the web, where content skews white, male, and western. Three face-image collections most widely cited in academic studies are 81 percent or more people with lighter skin, according to an IBM review.

Patrick Grother, a widely respected figure in facial recognition who leads NIST’s testing, says there may be other causes for lower accuracy on darker skin. One is photo quality. Photographic technology and techniques have been optimized for lighter skin from the beginnings of color film into the digital era. He also posed a more provocative hypothesis at a conference in November: that black faces are statistically more similar to one another than white faces are. “You might conjecture that human nature has got something to do with it,” he says. “Different demographic groups might have differences in the phenotypic expression of our genes.”

Michael King, an associate professor at Florida Institute of Technology who previously managed research programs for US intelligence agencies that included facial recognition, is less sure. “That’s one that I am not prepared to discuss at this point. We have just not got far enough in our research,” he says.

King’s latest results, with colleagues from FIT and University of Notre Dame, illustrate the challenge of explaining demographic inconsistency in facial recognition algorithms and what to do about it.

Their study tested four facial recognition algorithms—two commercial and two open source—on 53,000 mugshots. Mistakes that incorrectly matched two different people were more common for black faces, but errors in which matching faces went undetected were more common for white faces. A greater proportion of the mugshots of black people didn’t meet standards for ID photos, but that alone could not explain the skewed performance.

The researchers did find they could get the algorithms to perform equally for blacks and whites—but only by using different sensitivity settings for the two groups. That’s unlikely to be practical outside the lab because asking detectives or border agents to choose a different setting for different groups of people would create its own discrimination risks, and could draw lawsuits alleging racial profiling.

While King and others carefully probe algorithms in the lab, political fights over facial recognition are moving fast. Members of Congress on both sides of the aisle have promised action to rein in the technology, citing worries about accuracy for minorities. Tuesday, Oakland became the third US city to ban its agencies from using the technology since May, following Somerville, Massachusetts, and San Francisco.

King says that the science of figuring out how to make algorithms work the same on all faces will continue at its own pace. “Having these systems work equally well for different demographics or even understanding whether or why this might be possible is really a long term goal,” he says.


More Great WIRED Stories