How to Anonymise - Pseudonymise open data?
For a few months now, all you've been hearing about is the RGPD, the European Union's General Data Protection Regulation. Everyone is asking you how your company is going to comply... without really understanding what it's all about. Suppliers and consultants are becoming more and more inventive in offering you the chance to take part in events on the subject... but all they do is skim the surface.
When it comes to IT security solutions, you have already implemented encryption and carefully controlled access to your information system. Now you're looking at anonymisation, which comes up again and again in your discussions, along with pseudonymisation: how do you go about it? What organisation should be put in place?
SUMMARY :
A belated awareness
It is astonishing that we have waited for the advent of a major regulatory constraint to re-emphasise a discipline that has existed for so long.
It is therefore legitimate to ask "Why?
"Why have we waited so long?"
"Why hasn't this already been done?
... so obvious is it to all customers that you shouldn't "play" with your personal information.
There are lots of explanations, but in the end they only interest those who live in the past.
So let's look at the present picture: companies are sharing production data (the data they need for their day-to-day business) to meet a variety of needs:
- Copy the entire production run to enable developers and administrators to test upgrades, patches and version upgrades,
- Increase agility and competitiveness by developing new functionalities and analytical models in an environment as close as possible to production,
- Analyse trends (consumption, behaviour, medical research, etc.) by sharing data with consultants and researchers so that they can apply statistical or Machine Learning models.
As a result, billions of pieces of customer data (no matter how sensitive) are leaving production environments unprotected.
The RGPD, an accelerator for the accountability of all players
Recent studies by analysts into data confidentiality tend to show that companies have no way of knowing whether data leaving a production environment has been compromised.
I think the "Why?" becomes obvious: notwithstanding any regulatory constraints, the person whose personal data is being used without knowing whether it will be shared and compromised is you, it's me, it's our children...
The protection of privacy is a fundamental right guaranteed by the Universal Declaration of Human Rights.
We must all implement this mechanism to ensure that our data is used for justified and limited purposes.
This is why we must all, as company directors and IT system managers, implement this mechanism to ensure that our data is used for justified and limited purposes.
Identifying the right means of protection
The GDPR is not the answer to the "Why?" question, but it may be the beginning of the answer to the "How?
Firstly, the regulatory framework and, above all, the financial penalties and other fines associated with it, are a lever for financing the implementation of the anonymisation project.
Drawing up a register of data processing operations, as required by the GDPR, is a good way of pinpointing the exact location of personal data in the information system... so that you can quickly identify what needs to be anonymised.
Secondly, the Regulation urges us to think first and foremost about the need to process personal data and advocates the principle of data minimisation: "what is necessary with regard to the purposes for which it is processed".
For example:
Is it really necessary to have all the production data in the development, qualification or training environments? Ultimately, isn't it too costly and too risky?
Data sampling is a second response: reducing the risk surface by selecting (intelligently) a representative set of data, which can then be anonymised according to business needs.
The regulation also proposes simplified mechanisms, such as pseudonymisation, which consists of replacing personal data with a pseudonym, effectively masking the link to the original individual (provided that the link between the pseudonym and the individual is not trivial or preserved).
How can data anonymisation be implemented?
Having said that, none of these avenues gives companies any guidance as to how they should organise themselves. This may well be the Gordian knot of anonymisation:
- "Should I anonymise application by application?"
- "What should be done with applications that share the same individual's personal data?
- "What organisational structure will meet business requirements?
- "Will I lose agility in the evolution of the information system?
Clearly, organisation is the keystone of the anonymisation project and determines its success.
You need to implement an "industrialised anonymisation service" capable of meeting the needs of all the IT teams, who will be the most affected:
- have the capacity to address all technologies (while respecting their licensing and support rules, of course);
- offer high-performance, intelligent sampling: you don't just look at the first 1,000 lines... you look for a representative set of data in a data source and in subsidiary repositories (to guarantee referential integrity between applications);
- guarantee high-performance service levels: offer on-demand or automated anonymisation;
- provide a library of complete anonymisation formats (random replacement, data deletion, rewriting, etc.).
This anonymisation service will then bring about positive changes in the working methods of IT teams, with minimal impact on their day-to-day work.
Choosing the right tools
As you will have realised, this subject is not ultimately driven by technology. But what about the tools?
The literature will help you to understand the various anonymisation algorithms, such as 'k-anonymity', 'l-diversity', 't-proximity' or 'differential confidentiality', whose effectiveness and level of protection can be judged...
So many tools available to specialists to implement the right anonymisation for the right set of data.
Instead, I'd like to focus on an industrial anonymisation solution that guarantees :
- multi-source and multi-target connectivity, so as to be the central, unifying tool for the company, guaranteeing anonymisation that respects inter-application referential integrity ;
- a wizard, enabling the construction of anonymisation workflows tailored to the dataset (discovery of sensitive data in the source, proposal of suitable algorithms, preview of results, etc.);
- the ability to automate anonymisation chains to guarantee optimised service levels (night-time processing, on-demand dataset refreshing, etc.);
- ease of use, so that the team in charge of the anonymisation service can quickly and easily increase its skills and capacity.
Of course, the solution must guarantee that it itself complies with RGPD best practice: encryption, access control for privileged accounts, supervision... because the anonymisation infrastructure will be at the crossroads of personal data flows.
Oracle's " Data Masking Factory " initiative meets these requirements and has become an agnostic, high-performance solution for providing anonymisation services.
Personal data is no laughing matter
2018 is the year of the paradigm shift: it's the era of the realisation that our own personal data is what companies handle too lightly.
Everyone, at their own level, must integrate and understand that the game with data is over.
The RGPD is a reminder of good practice, of which anonymisation plays a key role.
More than just a technical project, we need an effective organisation and tools to provide business users with a high-performance anonymisation service.