Privacy researchers in Europe believe they have the first proof that a long-theorised vulnerability in systems designed to protect privacy by aggregating and adding noise to data to mask individual identities is no longer just a theory.
The research has implications for the immediate field of differential privacy and beyond — raising wide-ranging questions about how privacy is regulated if anonymization only works until a determined attacker figures out how to reverse the method that’s being used to dynamically fuzz the data.
Current EU law doesn’t recognise anonymous data as personal data. Although it does treat pseudoanonymized data as personal data because of the risk of re-identification.
Yet a growing body of research suggests the risk of de-anonymization on high dimension data sets is persistent. Even — per this latest research — when a database system has been very carefully designed with privacy protection in mind.
It suggests the entire business of protecting privacy needs to get a whole lot more dynamic to respond to the risk of perpetually evolving attacks.
Academics from Imperial College London and Université Catholique de Louvain are behind the new research.
This week, at the 28th USENIX Security Symposium, they presented a paper detailing a new class of noise-exploitation attacks on a query-based database that uses aggregation and noise injection to dynamically mask personal data.
The product they were looking at is a database querying framework, called Diffix — jointly developed by a German startup called Aircloak and the Max Planck Institute for Software Systems.
On its website Aircloak bills the technology as “the first GDPR-grade anonymization” — aka Europe’s General Data Protection Regulation, which began being applied last year, raising the bar for privacy compliance by introducing a data protection regime that includes fines that can scale up to 4% of a data processor’s global annual turnover.
What Aircloak is essentially offering is to manage GDPR risk by providing anonymity as a commercial service — allowing queries to be run on a data-set that let analysts gain valuable insights without accessing the data itself. The promise being it’s privacy (and GDPR) ‘safe’ because it’s designed to mask individual identities by returning anonymized results.
The problem is personal data that’s re-identifiable isn’t anonymous data. And the researchers were able to craft attacks that undo Diffix’s dynamic anonymity.
“What we did here is we studied the system and we showed that actually there is a vulnerability that exists in their system that allows us to use their system and to send carefully created queries that allow us to extract — to exfiltrate — information from the data-set that the system is supposed to protect,” explains Imperial College’s Yves-Alexandre de Montjoye, one of five co-authors of the paper.
“Differential privacy really shows that every time you answer one of my questions you’re giving me information and at some point — to the extreme — if you keep answering every single one of my questions I will ask you so many questions that at some point I will have figured out every single thing that exists in the database because every time you give me a bit more information,” he says of the premise behind the attack. “Something didn’t feel right… It was a bit too good to be true. That’s where we started.”
The researchers chose to focus on Diffix as they were responding to a bug bounty attack challenge put out by Aircloak.
“We start from one query and then we do a variation of it and by studying the differences between the queries we know that some of the noise will disappear, some of the noise will not disappear and by studying noise that does not disappear basically we figure out the sensitive information,” he explains.
“What a lot of people will do is try to cancel out the noise and recover the piece of information. What we’re doing with this attack is we’re taking it the other way round and we’re studying the noise… and by studying the noise we manage to infer the information that the noise was meant to protect.
“So instead of removing the noise we study statistically the noise sent back that we receive when we send carefully crafted queries — that’s how we attack the system.”
A vulnerability exists because the dynamically injected noise is data-dependent. Meaning it remains linked to the underlying information — and the researchers were able to show that carefully crafted queries can be devised to cross-reference responses that enable an attacker to reveal information the noise is intended to protect.
Or, to put it another way, a well designed attack can accurately infer personal data from fuzzy (‘anonymized’) responses.
This despite the system in question being “quite good,” as de Montjoye puts it of Diffix. “It’s well designed — they really put a lot of thought into this and what they do is they add quite a bit of noise to every answer that they send back to you to prevent attacks”.
“It’s what’s supposed to be protecting the system but it does leak information because the noise depends on the data that they’re trying to protect. And that’s really the property that we use to attack the system.”
The researchers were able to demonstrate the attack working with very high accuracy across four real-world data-sets. “We tried US censor data, we tried credit card data, we tried location,” he says. “What we showed for different data-sets is that this attack works very well.
“What we showed is our attack identified 93% of the people in the data-set to be at risk. And I think more importantly the method actually is very high accuracy — between 93% and 97% accuracy on a binary variable. So if it’s a true or false we would guess correctly between 93-97% of the time.”
They were also able to optimise the attack method so they could exfiltrate information with a relatively low level of queries per user — up to 32.
“Our goal was how low can we get that number so it would not look like abnormal behaviour,” he says. “We managed to decrease it in some cases up to 32 queries — which is very very little compared to what an analyst would do.”
After disclosing the attack to Aircloak, de Montjoye says it has developed a patch — and is describing the vulnerability as very low risk — but he points out it has yet to publish details of the patch so it’s not been possible to independently assess its effectiveness.
“It’s a bit unfortunate,” he adds. “Basically they acknowledge the vulnerability [but] they don’t say it’s an issue. On the website they classify it as low risk. It’s a bit disappointing on that front. I think they felt attacked and that was really not our goal.”
For the researchers the key takeaway from the work is that a change of mindset is needed around privacy protection akin to the shift the security industry underwent in moving from sitting behind a firewall waiting to be attacked to adopting a pro-active, adversarial approach that’s intended to out-smart hackers.
“As a community to really move to something closer to adversarial privacy,” he tells TechCrunch. “We need to start adopting the red team, blue team penetration testing that have become standard in security.
“At this point it’s unlikely that we’ll ever find like a perfect system so I think what we need to do is how do we find ways to see those vulnerabilities, patch those systems and really try to test those systems that are being deployed — and how do we ensure that those systems are truly secure?”
“What we take from this is really — it’s on the one hand we need the security, what can we learn from security including open systems, verification mechanism, we need a lot of pen testing that happens in security — how do we bring some of that to privacy?”
“If your system releases aggregated data and you added some noise this is not sufficient to make it anonymous and attacks probably exist,” he adds.
“This is much better than what people are doing when you take the dataset and you try to add noise directly to the data. You can see why intuitively it’s already much better. But even these systems are still are likely to have vulnerabilities. So the question is how do we find a balance, what is the role of the regulator, how do we move forward, and really how do we really learn from the security community?
“We need more than some ad hoc solutions and only limiting queries. Again limiting queries would be what differential privacy would do — but then in a practical setting it’s quite difficult.
“The last bit — again in security — is defence in depth. It’s basically a layered approach — it’s like we know the system is not perfect so on top of this we will add other protection.”
The research raises questions about the role of data protection authorities too.
During Diffix’s development, Aircloak writes on its website that it worked with France’s DPA, the CNIL, and a private company that certifies data protection products and services — saying: “In both cases we were successful in so far as we received essentially the strongest endorsement that each organization offers.”
Although it also says that experience “convinced us that no certification organization or DPA is really in a position to assert with high confidence that Diffix, or for that matter any complex anonymization technology, is anonymous”, adding: “These organizations either don’t have the expertise, or they don’t have the time and resources to devote to the problem.”
The researchers’ noise exploitation attack demonstrates how even a level of regulatory “endorsement” can look problematic. Even well designed, complex privacy systems can contain vulnerabilities and cannot offer perfect protection.
“It raises a tonne of questions,” says de Montjoye. “It is difficult. It fundamentally asks even the question of what is the role of the regulator here?
When you look at security my feeling is it’s kind of the regulator is setting standards and then really the role of the company is to ensure that you meet those standards. That’s kind of what happens in data breaches.
“At some point it’s really a question of — when something [bad] happens — whether or not this was sufficient or not as a [privacy] defence, what is the industry standard? It is a very difficult one.”
“Anonymization is baked in the law — it is not personal data anymore so there are really a lot of implications,” he adds. “Again from security we learn a lot of things on transparency. Good security and good encryption relies on open protocol and mechanisms that everyone can go and look and try to attack so there’s really a lot at this moment we need to learn from security.
“There’s no going to be any perfect system. Vulnerability will keep being discovered so the question is how do we make sure things are still ok moving forward and really learning from security — how do we quickly patch them, how do we make sure there is a lot of research around the system to limit the risk, to make sure vulnerabilities are discovered by the good guys, these are patched and really [what is] the role of the regulator?
“Data can have bad applications and a lot of really good applications so I think to me it’s really about how to try to get as much of the good while limiting as much as possible the privacy risk.”