governance

Andrey Kozichev

Fear and loathing in processing PII data

Why it matters?

Personally Identifiable Information (PII) is information that, when used alone or in combination with other relevant data, can identify an individual. PII may include particular identifiers, such as a passport number, or a combination of attributes that can collectively identify an individual, such as race, date of birth, postal code, etc.

With the introduction of the General Data Protection Regulation (GDPR) on May 25, 2018, companies operating within the EU are no longer permitted to store their customers' Personally Identifiable Information unless it is necessary for providing their services. Similar legislation has been implemented in many non-EU countries, making the storage and processing of PII a global concern. Organisations are mandated to demonstrate compliance with GDPR principles and document their data processing activities.

How do different companies handle PII?

Operational Data Stores often contain the majority of Personally Identifiable Information. This is the space where customers can directly access and update their details.

Organisations employ various techniques to safeguard PII in operational data stores, such as restricting access, encryption, or requiring additional verifications for access.

As data travels further through the platform, the need for accurate information decreases, enabling us to explore different approaches when moving data around while maintaining integrity and complying with GDPR.

Let's go through the main use-cases when we extract data from operational data stores.

Data Warehouse

Organisations build Data Warehouses (DWH) to drive their business analytics and make decisions based on cumulative statistics rather than individual records or events. Personally Identifiable Information is not needed there, so many organisations choose not to send it to the Data Warehouse in the first place.

PII can be removed, obfuscated, hashed or masked as part of the ETL process.

This work can take place in the 'staging' area, where the clean data presented for analysis has all PII stripped out or can be altered in place using 'views,' with access to raw data being restricted.

Testing

Creating a test dataset is another popular example of when PII gets in the way. Sometimes, we have to use live data for testing when it's not possible to synthesise the data or if synthesised data does not allow testing all edge cases in the same way as real data does. When this happens the Data Team can anonymise the live data, i.e. remove all PII before it can be used for testing.

Some form of Data Pipeline can be used for removing the PII. Unlike removing PII in DWH creating Test Dataset is much harder. We can not simply delete or mask the sensitive fields. The data should still be useful for the testing.

Often we have to use special techniques like anonymisation, partial reduction of the data or hashing with uniqueness to make sure that the value will be the same across all systems.

Another challenge of using Live Data for Testing is the fact that you are moving your data from a higher security zone to a lower security zone. From experience, some additional and independent assurance of the Data exposure risk is required.

Troubleshooting

This use-case can be considered a specialised scenario within the 'testing' category. Anyone who has run a live service has likely encountered scenarios where bugs couldn't be reproduced in their development (DEV) environments.

Most of the time, this discrepancy is due to the data being different. In the event of a P1/P2 incident in your live service, there isn't much time to transfer data from the live to the test environment.

The answer to this is incremental offloading and anonymisation. Instead of preparing an entirely new testing dataset, which can be a lengthy and tedious process, you do this for only a subset of records.

This subset can be limited by time or even by a list of specific records, accounts, IDs, etc. Anything to reduce the volume but keep the integrity of the data.

The way to achieve this is by having a ready pipeline that can run, extract the necessary payload, anonymise it, test the anonymisation process, and ingest it into your test environments on demand.

Security assurance in this case should cover not the Data, but the Pipeline, to make sure it can maintain an acceptable level of obfuscation/anonymisation.

Implementations

With any solution, the main complexity comes not from the process of anonymisation but from the mechanism used to detect the PII. The two solutions for detecting personal information which come to mind:

by using schema

by analysing content

Schema

Schema-based solutions are the easiest to implement and the most reliable. Define PII fields within data schemas in your organisation and use one of the transformation tools to remove them.

Data catalogues are naturally a great way to track such fields.

Unfortunately, this approach struggles with unstructured or semi-structured data or when schema changes occur too frequently.

If you're unable to reliably track and trace your Personally Identifiable Information (PII) fields using the schema method, it might fall short. In such scenarios, the only remaining option to detect PII is through content analysis.

Content analysis

Using regular expressions is one effective way to achieve this. This approach has been around for years and has proven to be very effective, especially when we know what we're looking for (such as credit card numbers).

The principle is simple:

Define the type of PII you expect in the data.
Write regular expressions for all use-cases you want to cover.
Create a negative expression to compensate for edge cases.
Apply the regex to your data and assign a score based on the number of matches.
Set up a rule to classify the data as PII or non-PII based on your score.

Even though many readily available regular expressions exist for the most popular PII examples, maintaining, updating, and dealing with false positives can be very labour-intensive.

This is where AI comes useful.

With a recent spike in the popularity of Large Language Models like ChatGPT and Machine Learning in general, this task looks like a very good application of AI power.

There are numerous AI-based SAAS solutions available not only to detect the PII but also to anonymise it for you. Even hosting providers like Azure or AWS have their own AI services which can be used to detect PII.

So why not use AI for everything?

There could be a few challenges with the AI approach:

Speed

With modern datasets reaching sizes of gigabytes or even terabytes, most AI-based Software as a Service (SAAS) solutions offer a REST interface. However, adding a single REST call during the pipeline run can significantly increase the processing time for all your records and documents.

Price

Similar to the speed issue, cost can become a concern quickly if you apply AI indiscriminately across your entire dataset.

Can anything be done about it?

Absolutely! You can host the PII detection model yourself. It doesn't have to break the bank and be the size of ChatGPT; detecting PII is a very narrow task and can be approached in a more focused way. However, you still need to have a capable ML engineering team.

An even better solution would be to use a combination of all the above methods! This way, you can strike a balance between speed, price, and efficiency. Employ schema validation for structured data, regular expressions to search for specific items, and use AI for text and document analysis, or even utilise AI to validate the final results before they are signed off. Straight forward job for a capable Data Engineering Team.

Technical implementations and useful links

Data Masking in Snowflake https://docs.snowflake.com/en/user-guide/security-column-ddm-intro and dbt example https://github.com/entechlog/dbt-snow-mask

Github projects around pii-detection: https://github.com/topics/pii-detection

Popular regular expressions for PII detection: https://github.com/Poogles/piiregex

DBT package for data obfuscation: https://hub.getdbt.com/pvcy/dbt_privacy/latest/

Airbyte example of low-code connector removing PII: https://airbyte.com/blog/etlt-gdpr-compliance

SAAS tools for data obfuscation: https://www.tonic.ai/ , https://www.privacydynamics.io/

PII Detection on Azure: https://learn.microsoft.com/en-us/azure/ai-services/language-service/personally-identifiable-information/overview

PII Detection on AWS: https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html

Utilizing a combination of schema validation for structured data, regular expressions to search for specific items, and AI for text and document analysis will yield fast, cost-efficient, and reliable results when removing Personally Identifiable Information (PII) from data.