Last week, we discussed federated learning, which is a tool that can help protect privacy when brain data is used to train machine learning models. This week, we will continue investigating privacy protection measures for brain data by exploring the concept of differential privacy (DP). Differential privacy is a mathematical formula that can set a privacy criteria for data when it’s shared. Given brain data’s ultra sensitivity, differential privacy could be particularly useful because it sets a guaranteed level of privacy protection for each person with brain data in a dataset. Let’s explore how this privacy mechanism works and how it can be applied to brain data protection.
When information from a dataset is shared, differential privacy can stop someone from digging out sensitive information from the data that’s been released. Rather than releasing the full data outright, the use of DP will result in summary statistics that give information about the data without sharing it overtly. These summary statistics are equipped with DP protection. More specifically, differential privacy ensures that when analysis is performed on data, the output does not reveal information about any specific person included inside. According to the rules of differential privacy, the inclusion or exclusion of any single individual’s data should not be able to significantly change the outcome of that analysis. For example, imagine that you’re looking at the brain activity of a random sample of teenagers. One of the individuals included in the data has significantly higher frequency EEG spikes. The high frequency of these spikes will appear as an outlier within the data and so to include or exclude this single data point would greatly change the outcome of the computation. Think about taking the average or the range of the data set. A singularly high frequency EEG reading in the data set would dramatically alter its average or range. To protect this individual’s data from being flagged in the data set, DP tools will blur out this outlier so that the individual can't be identified within the dataset. To accomplish this, “noise” is added to the data. Noise often takes the form of randomized values that are added to the data to obscure identifiable or sensitive information.
Differential privacy criteria can be set at varying degrees of stringency. This determines the extent that outliers, and other privacy-compromising data features, need to be hidden to satisfy the DP’s privacy guarantee. The level of privacy protection in a DP tool is set by the “privacy loss parameter”. The privacy loss parameter, usually notated as epsilon (𝜀), is a variable within the DP mathematical formula that determines how much privacy can be compromised for any single individual when an analysis is performed on a dataset. A privacy loss parameter of zero means that full privacy of each individual within the data set is fully guaranteed. However, perfect privacy always comes with a trade off. The greater the privacy guarantee in the DP equation, the more noise that is used and consequently, the less accurate the outcome is of any analysis using that data. A privacy parameter of zero is seldom used because it requires so much noise that it renders the data nearly useless.
The tradeoff between privacy and accuracy within differential privacy has made applying it within practical scenarios more challenging. Nonetheless, differential privacy is gaining more and more traction. This year, the U.S. government will be using differential privacy tools in the 2020 census to prevent some aspects of data, such as congressional district population information, from being reported in high detail to the public. The arsenal of differential privacy tools is broadening every day, with each new method tailorable to different contexts. As databases of brain data grow, differential privacy could be an important way to share insights and information housed within citizens' brain data without compromising personal privacy.