Data Catalog 3.0: Modern Metadata for the Modern Data Stack
2020 brought a lot of new words into our everyday vocabulary — think coronavirus, defund, and malarkey. But in the data world, another phrase has been making the rounds… the modern data stack.
The data world has recently converged around the best set of tools for dealing with massive amounts of data, aka the “modern data stack”. This includes setting up data infrastructure on best-of-breed tools like Snowflake for data warehousing, Databricks for data lakes, and Fivetran for data ingestion.
The good? The modern data stack is super fast, easy to scale up in seconds, and requires little overhead. The bad? It’s still a noob in terms of bringing governance, trust and context to data.
That’s where metadata comes in.
So what should modern metadata look like in today’s modern data stack? How can basic data catalogs evolve into a powerful vehicle for data democratization and governance? Why does metadata management need a paradigm shift to keep up with today’s needs?
In the past year, I’ve spoken to over 350 data leaders to understand their fundamental challenges with existing metadata management solutions and construct a vision for modern metadata management. I like to call this approach “Data Catalog 3.0”.
A few years ago, data would primarily be consumed by the IT team in an organization. However, today data teams are more diverse than ever — data engineers, analysts, analytics engineers, data scientists, product managers, business analysts, citizen data scientists, and more. Each of these people have their own favorite and equally diverse data tools, everything from SQL, Looker, and Jupyter to Python, Tableau, dbt, and R.
This diversity is both a strength and struggle. All of these people have different ways of approaching a problem, tools, skill sets, tech stacks, ways of working… essentially, they each have a unique “data DNA”.
The result is often chaos within collaboration. Frustrated questions like “What does this column name actually mean?” and “Why are the sales numbers on the dashboard wrong again?” bring speedy teams to a crawl when they need to use data.
These questions aren’t anything new. After all, Gartner has published its Magic Quadrant for Metadata Management Solutions for over 5 years now.
But there’s still no good solution. Most data catalogs are little more than band-aid solutions from the Hadoop era, rather than keeping in step with the innovation and advances behind today’s modern data stack.
Just like data, how we think about and work with metadata has steadily evolved over the past three decades. It can be broadly broken down into three stages of evolution: Data Catalog 1.0, Data Catalog 2.0, and Data Catalog 3.0.
Your Data Architecture: Simple Best Practices for Your Data Strategy. Don't miss this helpful article.
Why we need to rethink our approach to metadata management and data governance. Data Catalogs Are Dead; Long Live Data Discovery
In this blog post, I’m unpacking all of my learnings as a lead engineer on this project in the hopes that it helps other data practitioners and leaders who are facing similar challenges and want to successfully implement a modern data catalog.
In this post, we'll learn Getting Started With Data Lakes.<br><br> This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that's designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You'll also explore key benefits and common use cases.
The collection of tools and capabilities that should be part of your data platform today. A beginner’s guide to the best of breed tools and capabilities for your Data Platform initiative