BootcampSD: Anonymisation and modification of research data

Effective anonymisation of data minimises the risk of re-identification and in many cases can mean data is suitable for sharing. The exact information that needs to be removed or modified during the process of anonymisation will vary depending on the contents of the dataset and the reason that the unmodified data has been deemed to be unsuitable for publication. Firstly, all ‘direct identifiers’ must be removed from datasets intended for open publication. These include personal information such as name, date of birth, address, telephone number, email address or unanonymised photographs, video or audio recordings.

Removal of all direct identifiers may not be an adequate measure to prevent the identification of individuals. Datasets that contain two or more ‘indirect identifiers’ may identify participants when these identifiers are considered together. This is called ‘triangulation’. Examples of indirect identifiers include gender, rare disease, experience or characteristic, place of birth, ethnicity, socioeconomic data and body measures.

Potential data-linking also poses a risk; alone, a dataset may not contain enough information to identify individuals or place subjects at risk of identification, but when two or more datasets are combined, this may be achievable. This possibility must be considered by researchers planning to publish datasets.