Thoughts from the CTO: Privacy done right

Thoughts from the CTO: Privacy done right
Privacy by Design

The Facebook privacy scandal (of March 2018) raised public awareness of web privacy issues that have existed for years. Millions of Facebook users felt betrayed once they realized how their personal data had been used to predict and even influence political choice. If Facebook had followed “Privacy by Design” principles, they would have anticipated and prevented privacy invasive events before they happened (principle no. 1). They would also have built privacy into the system by default, so that no action was required from users to protect their privacy (principle no. 2). Didn’t Facebook hear about “Privacy by Design” and embrace it? They actually did.

In a talk called “Privacy by Design: Building Trust with Your App” at F8, Facebook’s 2016 annual developer conference, Facebook’s privacy team presented its privacy principles, best practices and developer tools. It also emphasized the fact that every new Facebook employee goes through privacy training, every product manager goes through “Privacy by Design” training, and certain employees are responsible for proactively advocating privacy throughout the entire product development lifecycle.

Despite the elaborate privacy program, any developer could have created an app that uses the popular Facebook Login option to let users login to the app using their Facebook account while granting the developer access to information about their personal profile (name, location, email), their friends list, as well as data about their friends (according to Facebook’s Terms of Service in 2015). This is essentially how Cambridge Analytica got access to data from millions of Facebook users. That data was collected following Facebook’s rules and guidelines.

A study on “Privacy by Design” published back in 2013 analyzed 10 earlier privacy incidents involving Facebook and Google. It concluded that all ten incidents might have been avoided with appropriate privacy engineering. It suggested, however, that the main challenge is not the lack of “Privacy by Design” guidelines, but business requirements that often compete with privacy concerns and win. In other words, “Privacy by Design” principles are not enough. They cannot be effectively enforced without an underlying software architecture that protects users privacy by default and makes it virtually impossible for developers to violate users privacy, even if that’s what a business requirement asks for. At Onist we call that our “Privacy-First Architecture”.

Privacy-First Architecture

Ultimate privacy protection only comes with software that allows the user, and only the user, to control their personal and sensitive data. Technically speaking, this can be done in 3 different ways:

  1. Storing the your data on your device alone
  2. Encrypting the data and making you the only one with the key, or
  3. Dividing the data into smaller pieces so they’re meaningless on their own and you’re the only one who can put them back together.

1. Storing data only on your device is not realistic. Devices can be lost, stolen or broken and then the information would be lost. Users today also expect their data to be safely stored on the cloud, accessible from multiple devices and shareable with others.

2. Encrypting data and making sure the user is the only one with the key is not practical for many people. People frequently forget their password (myself included!) and want their service providers to be able to restore access to their accounts.

Modern web and mobile apps allow users to access their data from multiple devices, collaborate and share data with other users, search for information, analyze information, receive content-based notifications, get insights, see recommendations and so forth. Providing these types of services, especially when big data is involved, requires access to information and functionality that are not available on the user’s device. Big data analysis, for example, requires computations that cannot be efficiently performed on your device. If data is stored only on an individual’s device or is encrypted and only the individual has the key, then the service provider can’t use your data to provide you with the many essential services you wanted from that company in the first place.

3. On the other hand, dividing the data into smaller pieces (so that no individual part can violate the user’s privacy and the user is the only one who can put them back together) could be used as the core concept for an effective Privacy-First Architecture. This is based on a de-identification technique called ‘pseudonymization’. This technique is strongly encouraged by the European General Data Protection Regulation (GDPR) as a way to significantly reduce data privacy risks.

Warning: I’m about to get a little technical so feel free to stop here if that isn’t for you.

Pseudonymization is the process of replacing personal identifiers such as name, email address or phone number with artificial identifiers like random combinations of letters and numbers that cannot directly identify the user. This is similar to ‘anonymization’. What’s the difference, you ask? Anonymization is, by definition, not reversible, while pseudonymization makes it possible to track the data back to its origin.

For example, a data record could represent someone’s bank account information. It could contain the following fields: account holder name, account number, account type, account balance and date. The account holder name and the account number are personal identifiers because they could be used to identify the person who owns the account. A pseudonymized record would replace them with artificial identifiers so those pieces of data can’t be used to find out who owns the account. A separate record would also hold the personal identifiers for the purpose of re-identification. These would all be stored in different locations with different access rights.

Pseudonymization in data privacy

Whoever has access to pseudonymized bank account information can see account types, balances and dates. They can perform calculations such as total balance or balance over time, but aren’t able to link the data back to any real person. The user is the only one with access to both the pseudonymized bank account information and their personal identifiers.

In this case, introducing a business requirement that violates user privacy will be hard to implement. To breach privacy would require explicit approval and significant development work to expose not only the pseudonymized information but also the personal identifiers to a 3rd party – it just wouldn’t happen.

Privacy-First Architecture comes with multiple challenges when it comes to development, information retrieval (search) and technical support. But as I see it, the benefits far outweigh the work that has to be put in up front to show your customers you care about protecting their privacy.