How pseudonyms additionally protect personal data

Smals developed three techniques by which sensitive data of citizens are additionally protected using pseudonyms.

Within the public sector, there are many systems that process sensitive information of individuals, including medical and social data. This data is very valuable, including to internal and external malicious parties, and thus must be adequately protected.

Cybersecurity consists of a broad set of measures and techniques, including policies, firewalls, access control and encryption. Identifier pseudonymization is located at the same low level as encryption: in the data layer that directly protects the data itself. It is a technique that allows data to no longer be processed under identifiers such as national registry numbers, but under pseudonyms. These are unique codes that can only be converted back to the original national registry number with a key. Moreover, they can only be used by a specific application or within a specific context.

Smals Research developed three different systems for identifier pseudonymization: blind pseudonymization, structure-preserving pseudonymization and “Oblivious Join.” These systems are also useful building blocks for GDPR compliance. Kristof Verslype, inspirer of these systems, looked at the concrete need within the social security and health sectors for this during a webinar.

On Oct. 10, 2024, Smals will give a presentation at Devoxx from 9:30 a.m. to 10:20 a.m. More information on that can be found here.

Structure-preserving pseudonymization

Software development involves several phases in separate environments: the testing phase is usually followed by the acceptance phase and finally there is the production phase. This procedure must guarantee that everything goes flawlessly once the application goes live and is repeated with each update.

In practice, personal data of citizens are regularly used in the testing and acceptance phases. This is undesirable and inevitably involves security risks. Thanks to structure-preserving pseudonymization, citizens’ privacy is better protected in the test and acceptance phases of already existing applications.

It is a common practice to regularly take a snapshot of the data in the production environment and then import it into the acceptance or test environment. The table above shows fictitious snapshot with personal data from the production environment. To improve privacy, two operations are performed prior to an import into the acceptance or test environment.

A first operation, pseudonymization, replaces national registry numbers with pseudonyms. These pseudonyms have the same structure as the original realm registry numbers. This structure-preserving nature is necessary since the application and underlying database can only handle something with the structure of a national registry number. For the unstructured identifiers, such as name and first name, a shuffle (column-wise permutation) is performed, literally shuffling the first and last names. The ‘shuffle’ is done locally by the organization, while for the pseudonymization a service of Smals is used.

Communication between the application in acceptance and the outside world must remain possible. This is done through a proxy, which, with the help of Smals’ pseudonymization service, replaces the pseudonyms in outgoing messages with the original state registry numbers, and replaces the state registry numbers in incoming messages with pseudonyms.

Smals’ structure-preserving pseudonymization service has been developed as a generic service. This means that a lot of organizations can use it if they wish, for a wide range of applications. Which operations are needed locally depends on the specific application. A ‘shuffle’ sufficed at the customer Smals works with, but this may not always be the case. The more generic aspect, being the conversion of identifiers into pseudonyms and vice versa, was accommodated in the Smals service.

If that service used traditional technology, it would have to maintain a table of realm registry pseudonym pairs for each test or acceptance environment using the service. The result would be a large number of tables, potentially containing hundreds of thousands or more rows. Those tables would then have to be kept piece by piece, requiring a lot of storage. Moreover, the tables change constantly as citizens are added (or eliminated) on a regular basis. This adds complexity and poses storage and synchronization challenges.

Smals’ service does not work with tables, but with compact 32-byte cryptographic keys, which remain constant over a long period of time. Pseudonyms are calculated on the fly based on incoming state registry numbers (or vice versa). The keys can be protected better than large tables by storing and managing them in Hardware Security Modules (HSM). These are built specifically to secure such cryptographic keys. Moreover, the entity that keeps the secure data is a different entity than the organization that manages the key(separation of duties). This results in an additional level of security. An additional benefit of a centralized service is that organizations using the service do not have to worry about key management.

Structure-preserving pseudonymization is very simple in terms of infrastructure, storage and synchronization.
Kristof Verslype, Smals Research

This structure-preserving pseudonymization can significantly increase citizen privacy in the acceptance and test environment of already existing applications, but is still in an experimental phase. Smals is currently looking at how to bring this service live.

eHealth blind pseudonymization

For new applications, Smals goes one step further thanks to privacy-by-design, where they already take into account the privacy of citizens when designing the application. Thus, personal data are not only better protected by pseudonyms in test and acceptance environments, but also in the production environment. This was the approach in eHealth’s blind pseudonymization service, which is already in practice today among physicians for protecting medical personal data, in the context of referral prescriptions, among other things. With a referral prescription, a healthcare provider prescribes to the patient different types of care other than pharmaceutical care (for example, care by a physical therapist, a home care nurse…).

How pseudonyms additionally protect personal data

As with structure-preserving pseudonymization, there is a separation of duties: the pseudonymization service knows the pseudonymization keys but does not have access to personal data. The back-end service sees the pseudonymized personal data but does not have access to the pseudonymization keys.

One approach used in the past was full encryption of personal data, with the exception of the national registry number. On the one hand, this system is very secure; on the other hand, such full encryption limits functionality. For example, it was not possible to validate input, extract statistics or do analysis on it.

To still maintain these functionalities without sacrificing privacy, a new approach was outlined. The figure above illustrates the scenario where a physician issues a prescription and the prescription data is kept under a pseudonym by a central service.

To prevent the pseudonymization service from profiling information about citizens based on metadata, we hide incoming and outgoing national register numbers and pseudonyms for that service. This is accomplished using the purple operations(blind and unblind). Blinding is a momentary encryption with a one-time key known only to the physician. The physician blinds the national registry number and sends it to the pseudonymization service. That service converts the blinded identifier to a blinded pseudonym(pseudonymise) and sends the result back to the physician who is the only one who can unblind it.

While this is a strong model in itself, neither do we want the pseudonyms to be visible to the doctor. Indeed, this is an additional identification risk: ideally, each party sees only what it is supposed to see sensu stricto. Therefore, two more steps are added (orange boxes): The pseudonymization service encrypts the blinded pseudonym and sends it back to the physician. The physician can only remove the blinded pseudonym. This leaves only an encrypted pseudonym. That is then sent to the central prescription service, which is the only one that can decrypt it. In summary, the physician sees only the national registry number, the service that stores personal data sees only the pseudonym, and the pseudonymization service sees neither.

To add another layer of security here, the pseudonymization service adds context to the blinded pseudonym before it is encrypted. This is verified by the underlying service (blue boxes). At a minimum, the pseudonymization service will add a timestamp. This allows the encrypted pseudonyms received by the healthcare provider to be used only for a limited time, which prevents misuse.

Blind pseudonymization thus ensures that each party can access only the strictly necessary information and no information leaks to the pseudonymization service. Moreover, on the healthcare provider’s side, this system does not use keys that need to be kept for long periods of time. This is a big advantage, since adequate key management is difficult and no one really likes to do it. The system offers a high level of security and can play an important role in reducing data breaches.

Oblivious Join

For research purposes, for example from universities, personal data originating from different sources must regularly be cross-referenced and pseudonymized. The latter is a necessary measure that helps prevent a researcher from linking personal data to natural persons. Such cross-referencing projects can become quite complex, especially when the data sources themselves cannot autonomously determine which citizens to provide data on. For example, in an earlier crossing project, the Belgian cancer registry had to provide data on citizens with Multiple Sclerosis (MS), without itself knowing who has that disease.

To address such crossing projects in an elegant, cost-effective and secure way, Smals developed “Oblivious Join. In this, there is no data leakage; the data sources do not learn any new personal data by participating in such projects, and the data receiver (the collector) only learns about pseudonymized data that is necessary in the context of the research. The collector is not yet the researcher himself, but should be thought of as an intermediary party.

Oblivious Join is also distributed; there is neither a central pseudonymization service nor a coordinating central party involved in the implementation of crossing projects. Everything is done through collaboration between the data sources and the collector.

Oblivious Join works in three steps. Step one is an automated protocol that creates cryptographic agreements between data sources. After that first step, the data sources can start pseudonymizing and encrypting data on their own that could potentially be relevant to the research project from their point of view, and send the result to the collector. Thanks to the agreements made in the first step, we can guarantee that in the third and final step, the collector can only decrypt and link pseudonymized data that are needed in the context of the research project.

In the earlier example, this means that the Belgian Cancer Registry first makes arrangements with a data source that does know who has MS. It then pseudonymizes and encrypts data on all citizens who received that cancer diagnosis and sends the result to the collector. The collector can decrypt only those records coming from the Belgian Cancer Registry that relate to people with MS.

In an Oblivious Join protocol, data sources see only the identifiers(national registry numbers)and no pseudonyms, the collector sees only the pseudonyms but not the identifiers. There is no intervention of a pseudonymization service. It is already noticed that the collector plays an important role here, and is therefore expected to remove irrelevant ciphertexts immediately. Furthermore, the collector will perform additional checks on the data to verify that the identification risk of individual records is not too great when conducting the research. It can then give the researcher access to the data in a controlled manner.

How does this work in practice? Both the data sources and the collector first download a piece of software: the Oblivious Join client. In addition, they receive a project-specific JSON file from the coordinating party, which contains all relevant information to execute the protocol automatically. Each data source additionally creates a CSV file containing all identified personal data potentially relevant to the study. The data sources then provide both the JSON file and the CSV file to the client, while the collector provides only the JSON file as input. The protocol is then executed, which means that the different parties communicate with each other. As a result of the protocol, the collector obtains the minimum necessary, pseudonymized personal data required in the context of the study (the black CSV file in the figure). This bundled CSV file includes the strike necessary, pseudonymized data coming from the three data sources relevant to the study.

In summary, Oblivious Join is privacy-friendly, secure, distributed and easy to use. It constitutes a secure way to unlock data, but is still in the experimental phase today.

From experimental to useful today

The above projects are each at different stages, from an experimental phase to an already used tool in practice. There is no one-size-fits-all solution when it comes to pseudonymization techniques. Which of solution is most suitable obviously depends on the concrete requirements.

Smals previously published two articles around pseudonymization on ITdaily. Those eager to dive deeper into the eHealth project can go here. There is also format-preserving encryption (FPE) that was standardized by the NIST. You can read all about that here.

————————————————-
This is an editorial in collaboration with Smals. On October 10, 2024, Smals will give a presentation at Devoxx from 9:30 to 10:20 a.m. More information on that can be found here.

Structure-preserving pseudonymization

eHealth blind pseudonymization

How pseudonyms additionally protect personal data

Oblivious Join

From experimental to useful today

newsletter