In this article series, we walk through how you can create your own data observability monitors from scratch, mapping to five key pillars of data health. Part 1 of this series was adapted from Barr Moses and Ryan Kearns’ O’Reilly training, Managing Data Downtime: Applying Observability to Your Data Pipelines, the industry’s first-ever course on data observability. The associated exercises are available here, and the adapted code shown in this article is available here.

From null values and duplicate rows, to modeling errors and schema changes, data can break for many reasons. Data testing is often our first line of defense against bad data, but what happens if data breaks during its life cycle?

We call this phenomenon data downtime, and it refers to periods of time where data is missing, erroneous, or otherwise inaccurate. Data downtime prompts us to ask questions such as:

  • Is the data up to date?
  • Is the data complete?
  • Are fields within expected ranges?
  • Is the null rate higher or lower than it should be?
  • Has the schema changed?

#2021 feb tutorials # overviews #data engineering #data quality #data science #data science platform #sql

Data Observability: Building Data Quality Monitors Using SQL
2.15 GEEK