Data is Always Dirty. Its ability to tell a story is only as good as your ability to clean it.

One of the best lessons a statistics 101 class can teach you is that the quality of your story depends on the quality of your data. And the more complex your problem is the more time you will likely spend cleaning your data before you even do anything else interesting with it. Some projects may require 90% of your time dedicated to cleaning data and if you don’t enjoy this or at least accept this fact it can make the job you are performing agonizingly painful. Know thy data — then clean thy data — then and only then analyze thy data!

With the growth of big data the sky is the limit for the potential dirtiness of data and those who embrace this fact are well suited to deal with any type of data that crosses their path. For one public health project that I worked on, one of my responsibilities was to make updates to a program that was run to create quarterly reports about a specific public health metric. The data for the report came from multiple stakeholders. One of the stakeholders in particular was sitting on a data treasure trove but their department was not well suited for handling these types of projects and the data was dirty — really dirty.

Not surprisingly there were regular dilemmas related to compiling reports for this public health metric that were directly related to the data. Individuals in data driven fields such as data science have multi-faceted jobs but an easy interpretation breaks their role down into two primary objectives — the first is to organize and analyze data to tell a coherent story (this is the reporting aspect) and to understand techniques related to handling data (this is the methods and programming aspect). The problem is that these two can be at odds with one another depending on the goal of the project and initiative being worked on.

This project lends itself perfectly to interview questions that start as “name a difficult project you worked on that caused you frustration and describe how you handled each aspect of this project”. For me — my goal coming in was to make adjustments to something that was already written. As such there were people more familiar with the project. To account for this I focused first on technical aspects that were addressable and tried to layer input from each person on the team according to the specific aspects of the project they excelled at or were particularly interested in. The goal being that no single person (myself included) could forget the overall objective as well as their specific tasks because everyone had an important role. This worked well — until it didn’t. Individuals who already have a stake in a project because they have worked on it bring their inherent biases to the table from the start. This revealed itself after the next (inevitable) road block was encountered after some seemingly quick wins that boosted morale.

