August 14, 2018

GDPR Technical Series #1: Anonymization and Pseudonymization

Subra Ramesh
GDPR Technical Series #1: Anonymization and Pseudonymization

A fundamental tenet of the European Union’s General Data Protection Regulation (GDPR)—which went into effect on May 25, 2018—is the recommendation to pseudonymize personal data wherever possible. Articles 4, 6, 25, 32, 40, and 89 as well as Recitals 28, 29, 75, 78, 85, and 156 of the GDPR explicitly call out pseudonymization. Interestingly enough, the GDPR does not refer to anonymization anywhere in the text. While the GDPR deems pseudonymization sufficient, it does not preclude other means of data protection. (Recital 28).

In this blog post, we examine differences between these terms. In the next blog post of our GDPR Series, we cover common element-level protection techniques and map anonymization and pseudonymization to those techniques.

Anonymization vs. Pseudonymization

First, let’s take the ideal case: anonymization. Anonymization is the transformation of data so that the data is no longer identifiable as being associated with a particular person. For anonymization to be effective, identification of the person associated with the data cannot be possible even with the addition of other knowledge about the anonymized data. The problem for data controllers and data processors with most cases of perfect anonymization is that the data is also rendered useless for any other analytics. Elimination of the ability to do valuable analytics could be one explanation for the GDPR’s omission of anonymization. Even then, however, anonymized data could still be useful for development and testing use cases.

Considering the original table below (Fig 1).

Last Name First Name Employee ID Email Address Title Start Date Department Salary Vacation Available (days)
Jones Edward 3565486 ejones234@kmail.com Technical Manager 09/01/2013 IT $13,0000 20
Xu Jason 56544884 j.xu.563@zahoo.com Architect 07/01/2010 Engineering $125,000 15
Stanton Joseph 2484686 Joseph.stanton4599@kmail.com CEO 01/03/2008 HQ $400,000 12
Powers Rebecca 4856459 Beckyp43543@kmail.com Director 02/02/2011 Sales $140,000 18

Fig 1 — Table before Anonymization

After Anonymization, the table would look like the one below (Fig 2).

Last Name First Name Employee ID Email Address Title Start Date Department Salary Vacation Available (days)
Jenkins David 34543593 djenkins546@kmail.com General Manager 07/02/2010 IT $170,000 15
Cortes Ramona 63458245 ramonacortes234@zahoo.com Director 05/02/2008 Engineering $145,000 15
Helleboid Jean 56455344 jean.helleboid764@kmail.com Manager 04/02/2011 HQ $155,000 10
Watson Brian 34534887 BrianWatson7679@kmail.com Manager 06/07/2004 Sales $123,000 19

Fig 2 — Table after Anonymization

Note that every field is transformed in Fig 2, except “Department.” Assuming each department consists of more than one person, getting back to the original data will not be possible, even with additional external information. However, if IT is a single-person department, then we have the person’s record. In our example, because all the values other than “Department” are anonymized, having the record is useless to anyone with access to the record.

Now consider pseudonymization. Let us assume that in addition to the “Department” column the “Salary” column is also not transformed. The following table (Fig 3) results from that action rather than the table in Fig 2.

Last Name First Name Employee ID Email Address Title Start Date Department Salary Vacation Available (days)
Jenkins David 34543593 djenkins546@kmail.com General Manager 07/02/2010 IT $130,000 15
Cortes Ramona 63458245 ramonacortes234@zahoo.com Director 05/02/2008 Engineering $125,000 15
Helleboid Jean 56455344 jean.helleboid764@kmail.com Manager 04/02/2011 HQ $400,000 10
Watson Brian 34534887 BrianWatson7679@kmail.com Manager 06/07/2004 Sales $140,000 19

Fig 3 — Table after Pseudonymization

In this case, if CEO Joseph Stanton (row 3) had not been in the data set, it would have been equivalent to an anonymized set. However, since the “Salary” column has not been transformed, the outlier in that column ($400,000) gives away the CEO’s identity and information. In other words, the knowledge that the CEO is likely to be paid well above everyone else essentially re-identifies the record after pseudonymization.

In the examples above, since the number of fields transformed is substantial in comparison with the total number of fields, the data, while usable for testing, is rendered useless for meaningful analysis. To be able to draw meaningful conclusions, the fields of interest in analysis need to be available without transformation—or at least be in the same range—so that aggregate results are the same.

Degrees of Anonymization

The degree of anonymization and indeed whether a data set is anonymized or pseudonymized depends on the nature of the un-transformed data and how much it might reveal. At some point when there is sufficient additional information giving clues that identify the original value, the transformation would be pseudonymization rather than anonymization. In the example above, the sufficient additional information is the knowledge that the CEO is likely the highest paid employee in the company. Additional information might be public information or data available in other tables or data stores in the organization.

There are measures of the degree of anonymization, such as the family of data similarity measures, including k-anonymity, l-diversity, t-closeness, and other criteria. There are also newer techniques such as differential privacy that de-identify the data. We will go into these measures and techniques in another blog post devoted to that subject.

As we saw in the discussion above, anonymization and pseudonymization are distinct approaches that protect the data as a whole, in the aggregate. The anonymization and pseudonymization effects are achieved by applying transformations at the element level. We will delve into these element level techniques in the next blog post and map those techniques to anonymization and pseudonymization.

Learn more about how PKWARE can protect your data and keep you GDPR compliant. Find out how with a free demo.

Share on social media
  • Government Cybersecurity Initiative for Healthcare

    PKWARE December 31, 2024
  • Why PK Protect vs. Symantec for Your Data Security Needs

    PKWARE December 17, 2024
  • The Evolution from PKZIP and SecureZIP to PK Protect

    PKWARE December 12, 2024
  • Data Breach Report: November 2024 Edition

    PKWARE December 9, 2024
  • Government Cybersecurity Initiative for Healthcare
    PKWARE December 31, 2024
  • Why PK Protect vs. Symantec for Your Data Security Needs
    PKWARE December 17, 2024
  • The Evolution from PKZIP and SecureZIP to PK Protect
    PKWARE December 12, 2024