Face recognition algorithms produce different rates of accuracy based on sex, age, and race or country of birth although algorithms that are more accurate generally produce fewer errors, says a new report by the National Institute of Standards and Technology (NIST).

NIST says the study on demographic effects is the “first of-its-kind and is the third report so far by NIST under its Face Recognition Vendor Test effort. It concludes that “We found empirical evidence for the existence of demographic differentials in the majority of contemporary face algorithms that we evaluated.”

The report makes two broad conclusions about the types of false matching algorithms generate, saying there are more false positives than false negatives. False positives refer to getting a positive match with samples of two different persons while false negatives show a failure to get a match for an individual when two images of the same person are used.

NIST says false positives “occur when the similarity between two photos is low, reflecting either some change in the person’s appearance or in the image properties” while false positives “occur when the digitized faces of two people are similar.”

The report had been anticipated and is of interest to lawmakers, privacy and civil liberties advocates, algorithm developers, and users of the face recognition technology, which is being rolled out in government identity programs such as the Department of Homeland Security efforts to identity and verify people entering and exiting the U.S. and traveling through security checkpoints at airports.

The study points to recent reports and media stories about biases in face recognition technology but cautions that such reporting should discuss the algorithm that was evaluated.

For false positives, the report says they there were more for women than men—two to five times higher—which was true “across algorithms and datasets,” NIST says. For false positives related to race, the study finds the rates are highest with people from West and East Africa and East Asia, and slightly less with South Asians and Central Americans with the lowest rates generally from East Europeans.

Using U.S. law enforcement images, the study says the highest false positives occur with American Indians while rates are elevated with African Americans and Asians, noting that “the relative ordering depends on sex and varies with algorithm.”

The report says that “a number of algorithms developed in China” show better results for false positives for East Asians.

When it comes to age, NIST says false positives are higher with the elderly and children and lowest with middle-aged adults.

NIST also notes that databases of facial image enrollment records can have a mitigating impact on demographic differentials.

“The presence of an enrollment database affords one-to-many algorithms a resource for mitigation of demographic effects that purely one-to-one verification systems do not have,” the report says. “We note that demographic differentials present in one-to-one verification algorithms are usually, but not always, present in one-to-many search algorithms.”

Algorithms Make a Difference

The report highlights that quality algorithms can make a difference.

“One important exception is that some developers supplied identification algorithms for which false positive differentials are undetectable,” NIST says. “Among those is IDEMIA, who publicly described how this was achieved. A further algorithm, NEC-3, is on many measures, the most accurate we have evaluated. Other developers producing algorithms with stable false positive rates are Aware [AWRE], Toshiba, Tevian and Real Networks. These algorithms also give false positive identification rates that are approximately independent of the size of enrollment database.”

In a statement, NEC said it is “pleased at the finding that our algorithm is among the most accurate ever tested and is among the small grouping of vendors where false positives based on demographic differentials were ‘undetectable.’”

With respect to false negative results, the report says the particular algorithm used can also make a big difference, from error rates below a half-percent to greater than 10 percent.

“For the more accurate algorithms, false negative rates are usually low with average demographic differentials being, necessarily, smaller still,” NIST says. “This is an important result: use of inaccurate algorithms will increase the magnitude of false negative differentials.”

The report also adds that in real-time cases, which would be those where someone is allowing their image to be captured, a second image capture can rectify an initial false negative.

Customs and Border Protection, a DHS component, and its airport and airline partners are rolling out face recognition matching for people departing the U.S. on international flights and arriving to the country on international flights. The Transportation Security Administration, another DHS component, is also evaluating the technology at some aviation security checkpoints.

“While it is usually incorrect to make statements across algorithms, we found empirical evidence for the existence of demographic differentials in the majority of the face recognition algorithms we studied,” says Patrick Grother, a NIST scientist and the report’s primary author. “While we do not explore what might cause these differentials, this data will be valuable to policymakers, developers and end users in thinking about the limitations and appropriate use of these algorithms.”

CBP maintains that its accuracy rates for matching travelers exiting the country are around 98 percent.

Report Elicits Concern

Rep. Bennie Thompson (D-Miss.), chairman of the House Homeland Security Committee, was highly critical of what the report’s findings mean.

“In recent years, the Department of Homeland Security has significantly increased its use of facial recognition technology on Americans and visitors alike, despite serious privacy and civil liberties concerns,” he said in a statement following release of the report. “This report not only confirms these concerns, but shows facial recognition systems are even more unreliable and racially biased than we feared. It is clear these systems have systemic design flaws that have not been fixed and may well negate their effectiveness. This Administration must reassess its plans for facial recognition technology in light of these shocking results.”

False negative rates found with images captured for border crossings are “generally higher in individuals born in Africa and the Caribbean, the effect being strong in older individuals,” the report says.

Using mugshots in the U.S., negative results are higher with Asians and American Indians, with the error rates higher than those of people with white and black faces, it says, adding that the lowest false negative rates occur in black faces.

Error rates are also usually higher in women and younger children, especially with mugshots, but the report says “there are many exceptions to this, so universal statements pertaining to algorithms false negative rates across sex and age are not supported.”

NIST says the quality of a photo is important in whether false negative rates are higher or lower. Photos acquired when someone is applying for a credential or benefit and compared to other “application” photos produce “very low” error rates and it becomes challenging to measure demographic differences, it says.

“This implies that better image quality reduces false negative rates and differentials,” NIST says.

NIST also says that when higher quality application photos are compared with lower quality border crossing photos, false negative rate are higher, particularly with women although “the differentials are smaller and not consistent.”

That said, NIST says its study didn’t consider the effect of cameras, and points out that for its research it didn’t have human-camera interaction data or failure to enroll.

“In fact, we not demographic effects even in high-quality images, notably elevated false positives,” the report says.

For the report, NIST evaluated 189 algorithms from 99 developers and used nearly 18.3 million images of 8.5 million people.