Cantech Letter had the opportunity recently to speak to Khaled El Emam, founder and CEO of Privacy Analytics, an Ottawa-based provider of data anonymization software designed to allow companies dealing with large data sets to anonymize that data so as to effectively analyze and share it while also minimizing risk of that data being used to re-identify individual personal details.
While Privacy Analytics currently specializes in healthcare data, and is gradually moving into financial services, it’s developed a solution that ought to concern any company dealing in data, which these days is basically every company.
People are frequently told that there is a choice between freedom and security in various aspects of life. In marketing, that choice is presented as a bargain made between the consumer and the data miner. By surrendering your data, you get a price break on that coffee you like. Or you get to share vacation photos with your friends. But what are those companies doing with your data? How is it being handled?
But innovation versus privacy is a false choice. We can absolutely have both, each making the other better.
A lot of companies employ what El Emam calls “Mickey Mouse” tactics when it comes to security, a term that vividly divides real companies from what ought to be considered the technological equivalent of summer camp.
Making innovative digital platforms and solutions that satisfy consumer demand for social media engagement and access to bargain goods should not come at the price of surrendering your personal information to digital marketing firms.
Recently, with the onrush of apps and start-up culture, it feels like the meaning of the word “innovation” has been adjusted to merely mean “making cool stuff”, while ideas like privacy and the treatment of personal data are hand-waved away as secondary considerations.
But that’s not what innovation means.
Innovation happens when the impulse to create is met by an obstacle. It’s how credible people react in the face of pesky details, like say regulation. If the word “regulation” rubs you the wrong way, think of it as a “design challenge”.
As data breaches continue to make headlines, it’s becoming clear that even major companies, from Sony to major financial institutions, are not taking issues of data security seriously enough.
How companies handle personal data, their ability to anonymize it, is likely to become a litmus test for VC investors as to whether a company can be taken seriously over the long term. Will a company’s treatment of data withstand regulatory requirements? If their approach to data security isn’t likely to hold up in court down the road, an investor might be wise to sniff elsewhere for a company that won’t eventually be held liable for poor decisions made during its initial growth period.
Innovators who do take data security seriously are beginning to recognize that addressing privacy concerns needs to be baked in to the development of technological solutions, for credibility’s sake.
At a recent Toronto event called the “De-Identification Symposium: Preserving Privacy AND Advancing Data Analytics,” Dr. Ann Cavoukian, Executive Director of Ryerson’s Privacy and Big Data Institute said, “Much has been written about the demise of privacy and data protection. But this perspective often comes loaded with unrealistic expectations and the pursuit of non-existent zero-risk solutions. It is far better to pursue real world solutions through ‘Privacy by Design’ to deliver the doubly-enabling, win/win, positive-sum framework of privacy and data analytics.”
Cantech Letter talked to Khaled El Emam, founder and CEO of Privacy Analytics, by phone recently.
Can you lay out for us what some of the unique challenges are for data anonymyzation in the health sector?
I think that, given the existing legal framework, and this is globally, if you want to use data for secondary purposes, that means not using it for care, because when you use data for care you know where the patients are and that’s not an issue. When you use it for secondary purposes, such as research, public health, any kind of sales and marketing, or drug development, then you really have two options. One is to obtain consent from every single patient, if you use it for that specific analysis. Or you anonymize the data. It’s not practical to obtain consent, especially if you have large databases and you don’t know what kind of analysis you’re going to do when you collect the data. So the only practical way to be able to use data and share it for secondary purposes is to anonymize it. That’s kind of where we are. That’s our existing legal framework. And as more data sharing is happening, it’s important to do this well. You want to maintain the trust of the public. And you want to ensure that these data flows continue and that the regulators don’t put rules in place that will shut down or limit the data sharing that is very beneficial for society, and economically as well.
On that note, what are the potential upsides for researchers in the health sector for having access to this data for research purposes? In benchmark terms, where are we and where could we be if researchers had unproblematic access to reliably anonymized data?
The benefits would be, I think, quite tremendous. Many discoveries made over the years were due to being able to access data. From a purely research perspective, from a public health surveillance perspective, in order to detect disease outbreaks, public health has always been a very data intensive exercise. But looking into the future, we’re looking at these learning health systems, where a doctor at the point of care can query a database and find all other patients similar to the patient that they’re treating, so that they can use the most up-to-date evidence at that time, not the last set of guidelines that took three years to put together based on evidence that was eight years old. But they can use evidence that’s very recent to inform their decisions. So that’s a vision for a learning health system, where old data are available and old experiences are being used to provide up-to-date information to treat patients. But from a research perspective, absolutely. You talk to any researcher and ask them, “Could you do more great things if you had more data?” and the answer will be a clear yes. And not having data limits what can be done, that’s also very clear. I think these are the starting points for the conversation about privacy in the world of Big Data. If you have data, great things will happen. Not just societal benefits, but commercial benefits as well. You can target clinical trials, you can look for safety problems in drugs, you can target your marketing, your sales efforts, and so on.
Is it a recent development to have truly anonymized data? Was it the norm before to have pseudonymized data?
No, it’s always been anonymized data. It’s just that there has been pushback from some industries that don’t want to anonymize their data, and they’re lobbying very hard to make pseudonymized data more acceptable. But the evidence is against them. I cannot see how the public could accept that, once they realize that essentially personal health information is being shared broadly under the guise of pseudonymized data. What’s been happening in the UK has been very interesting to watch in that regard, so we’ll see how that progresses. But the facts are what they are. You can ignore the facts and legislate or not, so we’ll see how that plays out. Anonymization is an area that has been around for decades. The national statistical agencies, Statistics Canada, the Census Bureau, Office of National Statistics in the UK, they have been releasing census data for 20-30 years. There’s a body of work that goes back that long around anonymization. There’s a large body of work. Census data has been shared for decades. There are many methods that have been developed, pretty strong methods, and there’s a whole disclosure control community that’s been working in this space for a while. This is not new. This has been around for some time. I think what’s new is the volume of data that’s being shared and the amount of attention this is getting because of the volume of data that’s being shared. As more health data is being digitized, then there’s more data to share and to use for secondary purposes. So this is a topic that’s moved away from being a very specialized group of people working at national statistics agencies to a much broader topic that intersects with Big Data and analytics efforts.
These are the starting points for the conversation about privacy in the world of Big Data. If you have data, great things will happen. Not just societal benefits, but commercial benefits as well.
A leading health legal expert told Bloomberg recently, “Electronic health information is like nuclear energy. If it’s harnessed and kept under tight control, it has potential for good. But if it gets out of control, the damage is incalculable.” What do you think he means by that?
He’s basically saying that if you’re going to share health information, you have to do it responsibly. Is everybody doing it responsibly? No. But that’s what we’re striving for. We’re striving for developing the standards, the methods, the tools, the training, to increase the number of organizations that are sharing data responsibly. Because locking up your medical records and doing nothing with it would be very unfortunate. There’s so much value and benefit that can come from analyzing that data and linking it with other sources, because there’s so much that we can learn.
People have a paradoxical attitude when it comes to, say, signing off on end user license agreements, where they just click the thing without reading it. But the optics around health data are just toxic. There’s a high degree of sensitivity around anonymous health data and institutional handling of health data. Do you regard that as more like an obstacle or a design challenge to be overcome?
To protect the privacy of consumers, you need to do two things with the data. One, you have to anonymize the data. Out of the box, no discussion, anonymize the data. That’s step number one. Because if you don’t do that, you can’t even start to have a conversation about trust and good stewardship and responsible data handling. The second thing you need to have in place is, let’s call it a “data access council” or an ethics council within the organization that informs the business about acceptable uses and decisions made from the data. Some organizations, such as Facebook for example, have implemented this and I know other organizations have implemented this as well, where they have a group of wise men and women with different backgrounds that essentially do a review or provide a second opinion about how the data is being analyzed and the decisions that are being made, to avoid discrimination, stigmatization, to avoid creepiness, to avoid surprising consumers with respect to how their data is being used. And that’s really an ethics and business and a public relations decision. It’s objective. It changes over time. What we accept today is going to be different than what we accepted five years ago. I sit on an ethics review board, and what we did not approve five years ago is normal today. So you have oversight and governance over the use of the data, and you anonymize the data. And with that, you address the most serious privacy concerns. But you have to be transparent about this, you have to be serious about it, you have to do it properly. You can’t put in place Mickey Mouse anonymization or Mickey Mouse governance. And I think with that, you can maintain consumer trust, you can meet regulator expectations. You will have constraints on what you can do with the data, but these are good constraints. They’ll keep you out of trouble.
You mentioned Facebook a minute ago, and I’m sure they do have an ethics commission within the company, but when you have the head of the company going around saying, “Privacy doesn’t exist. Get over it,” that’s not good optics as far as the public is concerned, no matter what’s going on behind the scenes.
Right, but companies grow up as well. That company two years ago, when those statements were made, is not the same company it is today.
You have to anonymize the data. Out of the box, no discussion, anonymize the data. That’s step number one. Because if you don’t do that, you can’t even start to have a conversation about trust and good stewardship and responsible data handling.
Can you explain what your new product, PARAT 6.0, does?
It anonymizes health data, both structured and unstructured data, so it can anonymize free-form text as well. One of its key characteristics is that it ensures data integrity. Most of our clients do analysis. Their business is analytics. So we ensure that the anonymized data gives you the same result, before and after anonymization. So it has minimal impact on the quality of the analytics. Again, this is the difference between serious and sophisticated anonymization, versus the naive approach. With the naive approach, you don’t really get the same data quality. The other key thing is, because we measure risk in the data, we can provide quite strong assurances that the data is anonymized, that it meets regulatory requirements. It’ll stand up to scrutiny. We’ve been working on PARAT for 10 years, so there’s a significant level of sophistication for the tool now to handle very complex data sets.
Sorry, can you back up a little and clarify the point about how it measures risk?
Right, that’s a key feature of our product. We measure the risk of re-identification. So how easy it is for a person to attack the data, we’ve quantified that. We’ve developed some quite sophisticated methods to do this. It’s part of the special sauce of our product. So now that we can measure risk, we can really optimize the anonymization to ensure that the risk is below an acceptable threshold, every single time. We can measure it, we can show it, we can document it. It’s objective, it’s a number, it’s a probability. And it provides pretty good defensibility, not only for regulators and courts, if it ever goes to litigation, but also to the public. The best available known methods are being applied here to protect your health information.
VCs are becoming more knowledgeable about this and so they will put pressure on the companies or data businesses that they invest in to pay more attention to this. But the reality is, you can jump up and down and hand-wave, this is the only known method today that works. End of story.
Is it your intention to concentrate for now on anonymizing data in the healthcare sector, and then apply this across to consumer data?
There’s a lot of health data in the consumer space as well, especially with Apple Health and some of the health data collection devices that are going out to the consumer. More generally, the two big spaces or verticals where privacy is a big issue today are health and financial services. We’re focusing on health for now, but financial services is not too far behind. We’re already acquiring clients in financial services as well, but our main focus is still health care.
With Uber being in the news recently over its famously cavalier attitude towards their users’ data, obviously the need for anonymization of data, given the current atmosphere around privacy and people’s sensitivity towards their personal information, there must be other applications for this type of technology across the board in the digital world.
Absolutely. Look, anonymization is the only proven, known method to protect personal information. It’s been around for a long time. We know it works, when done properly. That’s the caveat, it has to be done properly. And it’s the only known method that can protect personal information when you share it. So there’s no excuse for not using it, and putting this in place. Because stories like what’s happening at Uber and other companies are going to just keep on happening, unless organizations take this seriously. If you’re building a company or a business that’s essentially a data business, you have to do this. I think awareness is increasing, investors are becoming more knowledgeable about this, VCs are becoming more knowledgeable about this and so they will put pressure on the companies or data businesses that they invest in to pay more attention to this. But the reality is, you can jump up and down and hand-wave, this is the only known method today that works. End of story.
You mentioned earlier this notion of Mickey Mouse anonymization versus actual anonymization. I’d be curious to know what Mickey Mouse anonymization looks like. Is it a company just keeping data in an Excel spreadsheet, basically? How bad is this problem?
You can look at, for example, date of birth. Date of birth is something that can be used to easily identify individuals, because certainly in Canada your date of birth and postal code are unique for pretty much the whole population. I can get your postal code somewhat easily. Date of birth, there are ways to get access to that. So Mickey Mouse anonymization would be to add two days to your date of birth, which you see people doing that. They say, “Well, we added two days to it, so it’s no longer the original date of birth,” but that hasn’t really changed things enough to reduce the risk. There was a recent example in New York City, where the city released all the taxi data, hundreds of millions of records, through a freedom of information request. And the ID for the taxi drivers, the medallion numbers, were pseudonymized. But the pseudonymization scheme they used was very poor. I mean, it was basic pseudonymization 101. What they did is something you never, ever do. Within a couple of hours of the data being posted online, someone reverse-engineered all the medallion numbers and figured out who the taxi drivers were, because the driver numbers and names are public information. So now we know the incomes of all the taxi drivers in the city. So that’s an example of poor anonymization. And we see this happening, and it’s unfortunate. One of the challenges is insufficient standards for how to do this well. I think this is being recognized, and there are efforts now to develop standards. So that’s a good development.
I feel like the handling of personal information is one of the great challenges for small-to-medium enterprise and start-up culture and the tech world in general. Shouldn’t the proper handling of customers’ personal data be table stakes for getting into business?
You can’t make this stuff up. These are legal requirements here. You have to use something that will pass an audit, that will pass an investigation by a regulator, that will pass litigation, that will be defensible in front of a judge. You can’t make this stuff up. Doing it properly is important for business, for government and hospital work, wherever you are.