De-identification vs Pseudonymisation: Which Standard Does Your Project Require?

Introduction

One of the most common misunderstandings in health data governance is the conflation of de-identification and pseudonymisation. They are not synonymous, they carry different legal implications under UK GDPR, and the distinction determines whether the data you are working with remains in scope for data protection law at all.

This matters enormously for research organisations, clinical trial sponsors, and AI developers working with NHS records. Getting the classification wrong can expose you to regulatory risk and can compromise the legal basis on which your project rests.

Defining the Terms

Pseudonymisation

Pseudonymisation is defined in UK GDPR Article 4(5) as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures.”

The critical point is that pseudonymised data remains personal data under UK GDPR. The controller or processor still holds the key (even if it is kept separately), so the data can be re-identified. All UK GDPR obligations continue to apply in full.

Pseudonymisation is a security measure, not a route out of data protection law. It reduces the risk of re-identification in transit or in the event of a breach, and it is explicitly cited in Article 89 as an appropriate safeguard for research processing — but it does not remove the data from scope.

De-identification / Anonymisation

The ICO’s anonymisation code of practice describes anonymisation as the process of rendering personal data in a form where the data subject is no longer identifiable, considering all means reasonably likely to be used to re-identify them. Anonymised data falls outside the scope of UK GDPR entirely.

The key test is whether re-identification is “reasonably likely” given the resources, technology, and motivation of likely adversaries. This is not an absolute standard; it is a risk-based assessment that must be documented and reviewed as new re-identification techniques emerge.

Key point: The ICO has repeatedly warned that claiming anonymisation without documented, robust evidence of re-identification risk assessment is insufficient. The risk assessment must be proportionate to the sensitivity of the data and the context of its use.

The Risk Model

Whether data is truly anonymous is determined by a structured risk assessment, typically addressing:

Singling out: Can an individual be distinguished from others in the dataset?
Linkability: Can records relating to the same individual be linked across datasets?
Inference: Can sensitive attributes be inferred about individuals from the dataset?

Formal privacy models such as k-anonymity, l-diversity, and t-closeness provide quantitative frameworks for measuring these risks. Differential privacy provides probabilistic guarantees against inference attacks. The appropriate model depends on the data’s intended use, the likely adversary, and whether the output of the analysis (not just the dataset) could facilitate re-identification.

Which Standard Does Your Project Require?

Use pseudonymisation when:

You need to re-link records to the original patient at a later stage (e.g., longitudinal follow-up in a clinical trial)
The data custodian requires the ability to remove a patient’s data on request (under data subject rights)
You are transmitting data between trusted parties within the same governance framework

Use de-identification when:

The dataset will be shared outside the governance boundary of the original controller
The data will be used to train AI models on open or shared infrastructure
You need to demonstrate that the dataset falls outside the scope of data protection law for publication or open research purposes

Practical Implications for Dataset Procurement

When sourcing health data through a broker or directly from an NHS data controller, you should receive explicit documentation of the de-identification methodology applied. This should include: the privacy model used (k-anonymity threshold, suppression rules, noise addition), a re-identification risk assessment, and documentation of whether the output is classified as anonymous or pseudonymous under the ICO framework.

MD DataVault provides a full de-identification methodology report with every dataset, including the specific k-anonymity parameters applied and a documented risk assessment. This is a contractual commitment, not a best-effort claim.

Conclusion

The pseudonymisation/de-identification distinction is not a technicality. It determines your legal obligations, your organisation’s exposure to data protection enforcement, and the legitimacy of your research governance framework. Invest in getting it right at the design stage.