For The U.S. Census, Keeping Your Data Anonymous And Useful Is A Tricky Balance
As the country waits for more results from last year's national head count, the U.S. Census Bureau is facing an increasingly tricky balancing act.
How will the largest public data source in the United States continue to protect people's privacy while also sharing the detailed demographic information used for redrawing voting districts, guiding federal funding, and informing policymaking and research for the next decade?
Concerns have been brewing among census watchers about how the bureau will strike that balance, beginning with the redistricting data it's on track to put out by mid-August.
That release is expected to be the first set of 2020 census statistics to come with controversial new safeguards that bureau officials say are needed to keep people anonymous in publicly available data and prevent the exploitation of their personal information. But based on early tests, many data users are alarmed that the new privacy protections could render some of the new census statistics useless.
The state of Alabama has filed a federal lawsuit to try to block the bureau from putting these new protections in place. The case is currently before a three-judge court that is expected to rule soon on a request for an emergency court order. Whichever way it goes, the case is likely to reach the U.S. Supreme Court. The legal challenge could ultimately derail the bureau's schedule for releasing the data many state and local redistricting officials need to prepare for upcoming elections.
Here's what else you need to know:
Why does the Census Bureau have to protect people's privacy?
Under current law, the federal government is not allowed to release personally identifiable information from the census until 72 years after it's gathered for the constitutionally mandated tally. The bureau has relied on that promise of confidentiality to get many of the country's residents to volunteer their information once a decade, especially among people of color, immigrants and other historically undercounted groups who may be unsure about how their responses could be used against them.
But it is becoming harder for the bureau to uphold that pledge and continue releasing statistics from the census. Advances in computing and access to voter registration lists and commercial data sets that can be cross-referenced have made it easier to trace purportedly anonymized information back to an individual person.
For a way out of this conundrum, the bureau has been building a new privacy protection system based on a mathematical concept known as differential privacy. Invented at Microsoft's research arm, it has served as a framework for privacy measures in smaller Census Bureau projects, as well as at some tech companies.
"Differential privacy is in every iPhone and every iPad," says Cynthia Dwork, a computer scientist at Microsoft Research and Harvard University who co-invented differential privacy. "That may have a larger scale than the number of respondents to the U.S. decennial census, but there's a totality and commitment to privacy that's different here" with the bureau's plans for 2020 census data, Dwork adds.
How has the bureau protected people's privacy in past census data?
For decades, the bureau has stripped away names and addresses from census records before turning them into anonymized data. That information is broken down by race, ethnicity, age and sex to levels as detailed as a neighborhood.
But even in a sea of statistics, certain households — particularly those in the minority of a community — can stick out because they live in isolated areas or have other distinctive characteristics that could make it easier to reveal who they are.
As part of additional privacy protections over the years, the agency has withheld some data tables, and sometimes particular cells within tables, from the public in the past. The bureau has also added "noise" — or data for fuzzing the census results — to certain tables before releasing them. Beginning with data from the 1990 count, it has used a technique called "swapping" to switch out data about certain households with those from different neighborhoods.
What prompted the bureau to choose differential privacy to protect 2020 census data?
In 2016, researchers at the bureau began conducting internal experiments to test the strength of the privacy protections used for 2010 census data, and based on the results, agency officials concluded they can no longer rely on data swapping.
Using a fraction of the census data the bureau released a decade ago, the researchers were able to reconstruct a complete set of records for every person included in the 2010 census numbers. Then, after cross-referencing that reconstructed data with records bought from commercial databases, they were able to re-identify 52 million people by name, according to a court filing by John Abowd, the bureau's chief scientist. In a worst-case scenario, the bureau's researchers estimated, attackers with access to more commercial data could unmask the identities of as many 179 million people, or 58% of the population included in the 2010 census.
To try to better protect people's privacy for the 2020 census, the bureau announced in 2017 plans to create a new system, based on differential privacy, that officials say allows them to add the least amount of noise needed to preserve privacy in most of the released data and balance confidentiality and usability.
"Obviously, you know, it's not the easiest thing to do," the bureau's acting director, Ron Jarmin, said this month at the Population Association of America's annual meeting, adding that the bureau decided against data swapping and withholding certain tables as alternative safeguards. "To achieve a similar level of privacy protection with those sort of traditional methods, I think, would have produced a product that was even ... less useful for data users than what we're contemplating right now."
How will differential privacy affect 2020 census data?
The bureau says no noise was added to protect people's privacy in the new state population numbers, including those used to reallocate congressional seats and Electoral College votes, as well as numbers for Washington, D.C., and Puerto Rico. The bureau is also planning to release the total number of housing units in each census block, as well as the number of prisons, college dorms and other group-living quarters in each block, without privacy protections.
But it remains unclear how the bureau's differential privacy plans will affect other new redistricting data that is expected out by Aug. 16, including population numbers and demographic details about counties, cities and other smaller areas.
It will depend on the amount of noise the bureau chooses to add and how it tries to smooth out the effects of adding noise. Bureau officials plan to make their decisions for the new redistricting data in early June. Separate privacy protection decisions for other 2020 data sets are expected to be made later after gathering more public feedback.
Why have the bureau's differential privacy plans been controversial?
Preliminary tests of the bureau's new privacy protections have left many data users worried that their ability to use 2020 census statistics could be severely limited, particularly data about small geographic areas and minority groups within communities that many governments rely on for planning.
Bureau officials have stressed, however, that their differential privacy plans are still a work in progress. For now, they are gathering feedback from the public through May 28 before finalizing plans for new redistricting data next month.
In the meantime, Alabama filed a federal lawsuit in March and is trying to block the bureau from using differential privacy, which the state claims will make the data unusable for redrawing voting maps. Sixteen states, most of which also have Republican-controlled legislatures, are supporting Alabama's claims in an amicus brief.
And more lawsuits over differential privacy may be coming later, including from civil rights groups that have been monitoring the bureau's test data to see whether the new protections make it harder to ensure fair representation of people of color during redistricting.
"At this point, it seems not at all clear that anything the bureau releases will eliminate the possibility that the Voting Rights Act and its enforcement could be adversely affected by differential privacy," says Thomas Saenz, the president and general counsel of the Mexican American Legal Defense and Educational Fund who also serves on one of the bureau's committees of outside advisers.
What happens if the courts block the bureau from using differential privacy?
The release of 2020 census redistricting data — which is already late because of the coronavirus pandemic and the Trump administration's interference with the census schedule — could be further delayed by "multiple months" past August, Abowd, the bureau's chief scientist, has warned in a court filing.
"This delay is unavoidable because the Census Bureau would need to develop and test new systems and software," Abowd added, later estimating that the work could last for at least six to seven months.
Editor's note: Apple and Microsoft are among NPR's financial supporters.
Copyright 2021 NPR. To see more, visit https://www.npr.org.