Superglue — Journey of Lineage, Data Observability & Data Pipelines

Superglue — Journey of Lineage, Data Observability & Data Pipelines

Data plays a critical role in business decisions, AI/ML, product evolution and much more. Timeliness, accuracy, and reliability are the key foundational data requirements for every organization.

Data plays a critical role in business decisions, AI/ML, product evolution and much more. Timeliness, accuracy, and reliability are the key foundational data requirements for every organization. For a data-driven organization, it’s important to make data easily available for discovery, exploration, processing, governance, and consumption by users like data engineers, analysts, and data scientists. This requires significant investments in building platform tools that democratize data for data users.

Our journey to democratize data at Intuit started with two objectives:

1) reduce the amount of time users spend on building data pipelines (time-to-build)

2) reduce the amount of time users spend on detecting/resolving data issues (time-to-debug)

We built Superglue at Intuit to help users build, manage and monitor data pipelines. There are four core aspects to Superglue that I’ll cover in this blog — Lineage, Observability, Pipelines and Personalization. If you are a data leader, architect or platform engineer, this blog will help you learn the pattern to build lineage at scale, and how to monitor data and pipelines by help of lineage. Let me provide some background before we dive deeper.

Petabytes of diverse data, thousands of jobs, and layers of dependencies

Intuit has petabytes of diverse data collected from its products, applications and third parties. Thousands of Hive, Spark and Massively Parallel Processing (MPP) jobs use these data sets everyday to produce hundreds of reports that provide operational and business insights. Similarly, ML workflows use these data sets for feature engineering and model training.

Insights or features are generated through multiple layers of ingestion, processing and analytical layers using frameworks that are owned and managed by multiple teams. For example, one of the key reports depends on data from 18 different data sources that goes through 20+ levels of processing.

With such scale and complexity, when there are metric inaccuracies (as depicted below), identifying the root causes becomes extremely challenging.

Users would spend hours and in some cases days to get to the root cause of issues. And, the causes of such data issues could be many.

Where do you start to look for failures when an issue is reported? You are looking at thousands of running jobs that are owned by hundreds of users. These jobs use thousands of tables that are processed through many frameworks owned by multiple teams. This is where Superglue’s self-serve debugging experience, built on the foundation of Lineage and Observability, comes in to help detect the root causes of data issues.

Lineage

To get to the root cause of failures that could be in upstream data pipelines, we had to first get visibility into end-to-end lineage. The enterprise scheduler gave us dependencies that users specified when scheduling their jobs. Interestingly, 90% of analytical jobs were scheduled without any job dependencies. Due to this, any upstream delays or failures did not prevent downstream jobs to run; these jobs went ahead anyway and caused operational issues, metric inconsistencies, and data holes.

We determined the need to build lineage tracking based on source code in Git for data processing and data movement frameworks running Hive, Spark & MPP workloads. We use open source and custom SQL parsers to derive relationships between jobs, scripts and input/output tables. Similar parsing is done for BI reports and homegrown data movement frameworks to find associated tables.

Using this metadata, we “glue” the end to end lineage which includes three key entity types: jobs, tables & reports (a.k.a. dashboards). Users can search for these entities and land on their lineage view.

Here is an example of table lineage for the job selected from the search page. Jobs are represented in ovals and tables are represented in rectangles. Color of the job represents whether the job has failed (red), completed successfully (green) or is active (light green).

Here is scheduler lineage for the same job based on user specified job dependencies in the enterprise scheduler.

And, here is an example of lineage for a report, represented in a circle.

Dependency Recommendation

With table lineage based on source code and job lineage based on dependencies specified in the enterprise scheduler, we have visibility into which tables feed into which jobs, which tables are produced by which jobs and which jobs depend on which other jobs. This helped us build dependency recommendation as a feature in Superglue. Using this feature, we are able to pinpoint job dependencies that are missing. It’s like saying — “This job depends on these two tables which are created from these two other jobs, but you haven’t specified these two other jobs as dependencies. Please add them as dependencies in the scheduler.”

Lineage APIs

Along with backward lineage, we also made forward lineage available. Forward lineage helps with use cases that need to assess the impact of source and schema changes to downstream pipelines and/or reports. Lineage APIs enabled engineering automation to detect the impact of such changes.

Lineage APIs and data quality frameworks also played a key role when we moved thousands of analytical pipelines, tables and reports to the public cloud. Using forward lineage APIs, we were able to detect which pipelines and reports could be tested when the raw source data was ready in the cloud. Similarly, when we found metric issues in the cloud reports during migration, we could use backward lineage APIs to identify sources of data issues in raw tables.

Data Observability

Our next step was to build a debugging experience for users with an objective to reduce mean-time-to-detect & mean-time-to-restore data issues from hours to minutes (time-to-debug metric). With lineage as the backbone, we overlaid following features to enable data observability

  • Job execution stats and logs: Integration with the scheduler to capture start time, end time, run time, execution attempts, failures, logs and job dependencies
  • Table stats: Integration with custom data ingestion frameworks and Massively Parallel Processing (MPP) platforms to capture row counts and table sizes. In some cases, we were able to tap into MPP system tables to get rich table/column profiling stats.
  • Report stats: Integration with Business Intelligence(BI) tools to capture report SQLs, execution stats and refresh logs
  • Change tracking details: Integration with Git to capture changes to Hive, Spark and MPP jobs

Enabling these features served as a single platform for lineage and debugging. Here is an example of a job details page which appeared when clicking on a job on the lineage canvas.

big-data democratizing-data

What is Geek Coin

What is GeekCash, Geek Token

Best Visual Studio Code Themes of 2021

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Top Microsoft big data solutions Companies | Best Microsoft big data Developers

An extensively researched list of top microsoft big data analytics and solution with ratings & reviews to help find the best Microsoft big data solutions development companies around the world.

Silly mistakes that can cost ‘Big’ in Big Data Analytics

‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought

Big Data can be The ‘Big’ boon for The Modern Age Businesses

We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.

Your Data Architecture: Simple Best Practices for Your Data Strategy

Your Data Architecture: Simple Best Practices for Your Data Strategy. Don't miss this helpful article.

Role of Big Data in Healthcare - DZone Big Data

In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.