I’ll confess… more than once I have found myself producing, publishing, and publicizing incorrect data. I cannot recall exactly how I found that data — perhaps I ran a SHOW TABLES command in my data lake or data warehouse and got a result back that sounded legit. Or maybe I dug into a dashboard that referenced a column that seemed like what I needed. Perhaps I tried to track down the person who built a summary table a few months back, only to find they had left the company.

Once, when I ran a query and shared some data out broadly, my heart sank to the bottom of my stomach when an executive emailed me asking “hey, where did you pull this from? Your manager just said this metric was 24% higher.” The painful part of these all-too-common stories, and the thing I’ve felt most acutely, is that trust in data is broken.

“Not that you lied to me, but that I no longer believe you, has shaken me.” — Friedrich Nietzsche

This blog post gives an inside glimpse into data at Facebook and Airbnb, with practical advice on how to build a trustworthy data ecosystem.

Data at Facebook

When I worked at Facebook back in 2008, my official role was data analyst for the growth team. As a side project, I took it upon myself to teach 550 colleagues how to write their first SQL query. It was a great experience, and my colleagues loved the feeling of becoming data informed. Facebook had just developed Hive, and in order to help gain adoption for it, I took initiative to create these intro classes.

Once my colleagues got through the basics of SELECT and FROM, the first question asked was “how do I find the data I need?” This was a surprisingly challenging question to answer. We had a huge number of data tables with similar names and varying levels of relevance. As their teacher, I didn’t want to point someone to the wrong table, but how was I to know which of the tables was the right one?

Was it dim_user, dim_users, or dim_users_extended? Even if I did manage to point them to the right table, I did not know the nuances of how to query it in order to generate accurate metrics. For example, a simple COUNT(*) on our dim_users table would return number larger than what we reported for our count of active users. It turns out that if I did not filter out user_type=-1 and set the active_30d=1 then my results would be dead wrong.

This metric definition caveat was a huge issue, causing frustration and embarrassment for my colleagues when they produced incorrect results. I’ve found these metric problems so challenging that my new company, Transform, is building tools and frameworks to help companies correctly define and catalog their key performance indicators.

Facebook architecture diagram

Image for post

*Note the potential problem with BI tools referencing two different data systems.

Data at Airbnb

When I left Facebook to join Airbnb in 2014 as the PM for data infrastructure and data tools, I vowed to get ahead of this problem before it became institutionalized and intractable. At first glance, the Airbnb data lake already looked pretty daunting, with thousands of tables and nearly a petabyte of data. To be perfectly honest, it was treacherous.

Due to some early infra challenges (that I wrote about here), the organization had low trust in data. We lacked a credible, single source of truth for important data tables and key metrics. Almost all analytical insights were generated by a select few data scientists who had context on their particular data domains, and those folks got bombarded with questions all day long. If one of those data scientists left the company, it was a nightmare to unwind their labyrinthine pipelines to find the actual SQL that defined their metrics.

Although there were some early challenges, we stayed committed to fixing this problem because we recognized real potential to create trustworthy, accurate datasets. The stakes were high with this work, though. If we didn’t build trust in data, the company would have wasted millions of dollars on big data infrastructure, data tools to increase productivity, and a technical data science staff. How could we rebuild faith in the accuracy and correctness of data when there were dangers all around?

#data-warehouse #data-lake #metrics #data #airbnb #data analysis

An island of truth: practical data advice from Facebook and Airbnb
1.20 GEEK