A while back a colleague asked for my help anonymizing some data. The data set he provided piqued my interest and led to some of what we are going to discuss today.

When we talk of anonymizing data, we mean to remove personal information or any information that could be used to identify or track an individual. In some cases, this is referred to as personally identifiable information (pii). This information could include but is not limited to: names, emails, addresses, phone numbers, geolocation, social security numbers (ssn), vin numbers, account numbers, identification numbers, credit card information etc

This document would walk you through how to build python queries that extract these items from your data set. In terms of scope, this document adopts a regex approach to identify and replace emails, sensitive numbers, vins and addresses from a data set or string.

Emails

E mails are extremely common and are used in almost every application. In fact, when most people give out their emails they do not realize they are giving out a wealth of information. While emails may be useful to data scientists, for monetization or other purposes that require the data to be de-identified, we might need to get rid of email data while leaving the rest of the data in tact. Using a simple regex function can help with the problem.

For instance:

Image for post

Email Example 1, Created by Author

Note that in this case the regex is case insensitive. However, a space in the email such as snubholic @yahoo.com would change the results.
Code: re.findall("\S+@\S+", dn)

#data-science #personal-data #vin-number #gdpr-compliance #address-email-phone-ssn #data analysis

Anonymizing Data Sets
1.10 GEEK