Seven common misconceptions about data anonymization
Personal health data is valuable. It can be used for research, innovation, and development, but secondary use of health data is only possible if privacy protection can be secured and the data is high quality. Data anonymization is a crucial process that can ensure the privacy and security of sensitive information while allowing it to be utilized for secondary downstream analysis, but there are many misconceptions surrounding data anonymization. This article will explore some of these misconceptions and provide a more clear understanding of the reality of data anonymization.
Misconception #1: “The quality of anonymized data is not good”
One common misconception is that the quality of anonymized data is not good enough for practical downstream use. This idea often stems from experience with outdated, legacy anonymization techniques.
Next-generation anonymization technologies like VEIL.AI’s Anonymization Engine can maintain excellent data quality, and can even achieve the same results in downstream analysis as original data.
“ In our study, we could draw the same conclusions from anonymized data as from traditional pseudonymized, individual-level research data.”
—Jussi Leinonen, Strategic Project Lead, Bayer
Misconception #2: “There is only one way to anonymize a dataset”
Anonymization is a process that ensures that no individual in the data can be re-identified. This process is done by subtly changing the contents of the data such that it meets a specific set of privacy-protecting criteria.
But there are actually an astronomical number of ways to change the dataset such that the privacy criteria are fulfilled and no individual can be identified in the data. This means that depending on the approach you use, the anonymization can create a range of data qualities from poor quality, unusable data to very high quality data.
Unlike legacy anonymization methods, VEIL.AI’s Next-Generation Anonymization uses artificial intelligence to pick the optimal solution for anonymizing the data. This means that the data is changed in such a way that it both fulfills privacy criteria and looks just like the original data in almost all properties. As noted above, it can even allow an analyst to come to the same conclusions as if they were using the original data.
Misconception #3: “Anonymized data is always aggregated data”
Another misconception is that anonymized data is always aggregated, lacking individual-level details. However, advancements in anonymization techniques have made it possible to achieve high-quality record-level anonymization.
The real misunderstanding comes from the definition of an “aggregate”. Traditionally, an aggregate is calculated by averaging three to five data points together. But since anonymized data contains row-level data that are indistinguishable from one another (in other words, there are no unique records), they can be considered an aggregate without having to average the numbers together.
Misconception #4: “Anonymization Can Be Achieved by Simply Removing Direct Identifiers”
An often oversimplified view of anonymization is that it can be achieved by merely removing direct identifiers from a dataset. However, this approach falls short in ensuring privacy and is not necessarily compliant with the European General Data Protection Regulation (GDPR).
Anonymization requires a more comprehensive approach that goes beyond direct identifier removal, because individuals might be able to be identified indirectly. It involves consideration of the context and nature of the data, understanding potential indirect identifiers, and applying appropriate anonymization techniques to prevent re-identification. A thorough understanding of privacy risks, combined with robust anonymization methods, is essential to achieve effective privacy protection.
Misconception #5: “HIPAA de-identified data is the same as anonymized data”
Many people assume that HIPAA (Health Insurance Portability and Accountability Act) de-identified data is synonymous with anonymized data. However, this is not always the case.
HIPAA requires certain direct identifiers to be removed to de-identify data (read more about HIPAA de-identified vs GDPR anonymized data here). But as mentioned before, removing direct identifiers from the data does not guarantee true anonymity. Anonymization involves a more rigorous process of ensuring that data cannot be re-identified, providing a higher level of privacy protection.
Misconception #6: “I need to have a massive amount of data to anonymize it”
The belief that a massive amount of data is necessary for effective anonymization is another misconception. The data size requirement depends on the nature of the dataset and the analyses being performed.
Certain analyses are more sensitive to small variations, while others benefit from larger datasets. The level of uncertainty increases with the number of variables, whereas specificity improves with a larger number of data points. The number of records needed for anonymization varies accordingly.
The best way to ensure that the amount of data is sufficient for anonymization is to ask. In addition to data anonymization, VEIL.AI offers post-anonymization privacy risk analysis. This pipeline generates risk metrics of data re-identification and can allow you to evaluate whether your dataset still contains privacy risks.
Misconception #7: “Anonymization Is a One-Size-Fits-All Solution”
Some individuals believe that there is a universal anonymization technique that can be applied to any dataset, regardless of its nature or purpose. However, the reality is that effective anonymization is highly context-dependent. Different datasets have distinct privacy risks, and the appropriate anonymization methods should be tailored accordingly.
Factors such as the sensitivity of the data, the intended use, and legal and regulatory requirements should be considered when designing an anonymization strategy. One-size-fits-all approaches often lead to either inadequate privacy protection or unnecessary loss of data utility.
Because VEIL.AI’s Anonymization Engine utilizes AI to search for the best way to anonymize your dataset, you can customize the parameters with the particular variables that are most important for your data or analysis question.
Text: Timo Miettinen