Hash Personal Identifiable Information (PII) in your ELT pipelines
Falk
Posted on August 15, 2022
What is Personal Identifiable Information?
Personal Identifiable Information (PII) is defined as: Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. If you collect, use or store PII of people in the European Union, you have to work GDPR-compliant and therefore should protect your customers personal data.
So what, I just delete those
While you could of course think about deleting any PII from your data before adding it to your file storage/ database, the data users in your organization might have a word to say about that. Oftentimes, Analysts or Data Scientists have to work with multiple data sources and find connections between them in order to generate insights.
For example, finding a customer across various sales platforms such as Amazon, Shopify, or eBay. These platforms don't have a common unique identifier for each user, so an alternative such as an email, name, or phone number must be used. That is however not possible anymore if we've decided to delete this personal information. So let's try to add a "Hashing" step to our ELT pipeline.
Introducing ByeByePii
ByeByePii is a Python package that is meant for hashing personal identifiable information. It was built focused on making Data Lakes storing JSON files GDPR-compliant. It's a simple package with two features:
- Analyzing Python Dictionaries in order to identify PII
- Hashing PII in a given Python dictionary
Binary installers for the latest released version are available at the Python Package Index (PyPI):
pip install ByeByePii
Analyzing a JSON and creating a list of keys to hash
In order to not having to manually look for all the keys in a Python Dictionary, we can use the analyzeDict
function.
import byebyepii
import json
if __name__ == '__main__':
# Loading local JSON file
with open('data.json') as json_file:
data = json.load(json_file)
# Analyzing the dictionary and creating our hash list
key_list, subkey_list = byebyepii.analyzeDict(data)
$ python3 analyzeDict.py
Add BuyerInfo - BuyerEmail to hash list? (y/n) y
Add SalesChannel to hash list? (y/n) n
Add OrderStatus to hash list? (y/n) n
Add PurchaseDate to hash list? (y/n) n
Add ShippingAddress - StateOrRegion to hash list? (y/n) y
Add ShippingAddress - PostalCode to hash list? (y/n) y
Add ShippingAddress - City to hash list? (y/n) n
Add ShippingAddress - CountryCode to hash list? (y/n) n
Add LastUpdateDate to hash list? (y/n) n
Keys to hash: ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
Subkeys to hash: ['BuyerEmail', 'StateOrRegion', 'PostalCode']
Hashing PII in a given JSON
Using the key lists we just created we can proceed to hash the PII in the dictionary.
import byebyepii
import json
if __name__ == '__main__':
# Loading local JSON file
with open('data.json') as json_file:
data = json.load(json_file)
# Hasing the PII
keys_to_hash = ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
subkeys_to_hash = ['BuyerEmail', 'StateOrRegion', 'PostalCode']
hashed_pii = byebyepii.hashPii(data, keys_to_hash, subkeys_to_hash)
# Writing the updated JSON file
with open('hashed_data.json', 'w') as outfile:
json.dump(hashed_pii, outfile)
Before:
{
"BuyerInfo": {
"BuyerEmail": "test@test.com"
},
"EarliestShipDate": "2022-01-01T23:59:59Z",
"SalesChannel": "Website",
"OrderStatus": "Shipped",
"PurchaseDate": "2022-01-01T23:59:59Z",
"ShippingAddress": {
"StateOrRegion": "West Midlands",
"PostalCode": "DY9 0TH",
"City": "STOURBRIDGE",
"CountryCode": "GB"
},
"LastUpdateDate": "2022-01-01T23:59:59Z",
}
After:
{
"BuyerInfo": {
"BuyerEmail": "037a51cb9162f51772eaf6b0fb02e1b5d0bf8219deacf723eeedc162209bfd33"
},
"EarliestShipDate": "2022-01-01T23:59:59Z",
"SalesChannel": "Website",
"OrderStatus": "Shipped",
"PurchaseDate": "2022-01-01T23:59:59Z",
"ShippingAddress": {
"StateOrRegion": "08fa57d00de1936ebea7aeaf8e36d04510a5d885cfaa4f169c2b010d36ccaca4",
"PostalCode": "714f02c01e20988ee273776dc218f44326c2f5839618b0c117413b0cc7d91701",
"City": "STOURBRIDGE",
"CountryCode": "GB"
},
"LastUpdateDate": "2022-01-01T23:59:59Z",
}
Since the string test@test.com
will always be hashed to 037a51cb9162f51772eaf6b0fb02e1b5d0bf8219deacf723eeedc162209bfd33
it is still perfectly usable as a cross-functional identifier.
Posted on August 15, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.