Everybody loves a fad. You can pinpoint someone’s generation better than carbon dating by asking them what their favorite toys and gadgets were as a kid. Tamagotchi and pogs? You were born around 1988, weren’t you? Coleco Electronic Quarterback and Garanimals? Well well, an early X-er. A fad is cultural currency and social lubricant at the same time: even if you don’t have the thing itself, it’s a shared reference point that helps locate you as part of a particular time and place. Paradoxically, fads also help identify when a concept has gone stale, depending on who does it.

Image for post

Honestly, he was doing us a favor killing those things off.

Fads happen in business, too. From corporate retreats to themed attire days (back in the olden times when we went to retreats, offices or, you know, anywhere) or the more recent mandatory fun on Zoom, enterprises are no less susceptible to fads, _especially _when they involve technology. Part of it is a desire to seem cutting edge, but a large part of it, we think, is simple misunderstanding. Without a good grasp of new systems and tools or the concepts that underlie them, it’s hard to tell the difference between a fad and a future.

Guess Who?!

Case in point: anonymization. Although the concept of masking identity or erasing identifiable features has long been a component of data science, it was not a widespread topic of discussion in industry in the US until the late 2000s and, really, just before GDPR came into effect and fears of 4% penalties kicked in. Hundreds of vendors promise services that allow you to “anonymize” user data in an effort to find safe harbors or avoid liability, but most businesses have only a vague understanding of what the concept of anonymized data really is and how to do it.

To unpack anonymous data, it’s important to clear up a few terms so that we don’t run into confusion. First, what _is _anonymized? Anonymous data is data that does not relate to an identified or identifiable natural person, or data modified such that the data subject is not or no longer identifiable.

That is an extremely vague definition for a concept that is so important, and so let’s dive into that a little more, because this is a game of definitions (every lawyer’s favorite game). If data, on its own or with other data, can identify you, it’s personal data. We don’t talk about personally identifiable information, any more; that fad has passed. These days, you only talk about personal data.

Image for post

“PII? Are you kidding me?”

There are ways to make data less useful in identifying a person, but that does not mean that it is anonymous. Instead, there are varying degrees of data obfuscation — means hiding attributes to make reidentification more difficult — on the way to actual anonymization. Here are the two most important kinds.

Masked Data

Masked Data is information modified to hide (or “mask”) the underlying, true data. This is a common practice in business, and it is most effective against unauthorized internal review (and pilfering) of valuable business/customer data and against external actors learning important details about clients and vendors. A simplified explanation of masked data is a customer list that details first and last name, age, address, and amount spent with surnames changed to dummy names, ages shifted, and amounts spent reallocated randomly. Much of the derivative analytic data remains the same (amounts spent, total number of customers, locations of accounts, etc) but it is difficult to reidentify any individual user.

What it Isn’t

Having a list where the names and identifiers are shifted is a great business approach, but it usually falls short of anonymous in the real world. Why? Because usable data is _accurate _data, and being able to run the kind of analytics you want means being able to easily mix and match the true underlying information. As such, having the master list (the non-masked data) available means that you will always hold onto the original information, which means you’re still holding personal data, which means you’re not protected by the anonymity safe harbor. Thanks for playing.

Image for post

I…I didn’t realize we were playing a game?

Pseudonymized Data

Pseudonymous data is data that has the most important identifiers removed: names, email addresses, social security numbers, etc. Pseudonymous data still identifies a person, but it isn’t obvious on its face who that person is. Think back to school when they would post grades outside of a classroom but only use student numbers on the chart. In the Mad-Max rush to the sheet of paper to see your grades, it wasn’t possible to see anyone else’s name, and so you only were able to know what your outcome was. This is a good example of pseudonymization and a good example of why it’s used: to protect the rights of individuals from unnecessary exposure of their personal details, including a devastatingly embarrassing failed geometry test in ninth grade.

The more attributes you remove from a dataset, the thinking goes, the more pseudonymized the data becomes, and the closer it gets to full anonymization, at which point you’re in the clear.

#data-science #law #privacy #data

Anonymous Schanonymous
1.10 GEEK