BootcampSD: Modifying information

Modifying information
Section 9 of 10

redacted text

Completely removing information from a dataset ensures the information cannot be used to identify participants or subjects. Sometimes, however, information can be modified enough that it no longer poses risk of identification and can thus remain in the dataset. This involves more effort but is a good option if complete removal of the information significantly de-values the dataset.

Methods of modifying data to limit identification include:

  • Combining responses into categories, or a fewer number of categories than in the original dataset. This is a good option if only a small number of people or subjects possess the characteristic. For example, year of birth can be collapsed into 5 or 10-year age bands if only a few people in a dataset share a specific birth year
  • Top and bottom coding: collapsing categories into upper and/or lower thresholds. This is a good option if only a small number of people have high or low measurements on a characteristic. For example, if few people report that they have more than five children, these participants can be combined with those who report five children and recoded as ‘5+ children’
  • Rounding dates, times, or measurements reduces the risk of identification when only a small number of people or subjects have a specific value
  • Data suppression: involves creating ‘missing data’ if the inclusion of this data poses a risk to identification. Single values may be deleted, or all data for an at-risk research participant or subject
  • Methods used to modify a dataset must be documented and documentation must be published alongside the modified dataset if it is to retain its value to other researchers.

    Question: which of the following is a direct identifier?

    1. Year of birth

    2. Names of relatives

    3. Occupation