A beginner’s guide to implementing the latest industry trend: a data mesh. Ask anyone in the data industry what’s hot these days and chances are “data mesh” will rise to the top of the list.
Ask anyone in the data industry what’s hot these days and chances are “data mesh” will rise to the top of the list. But what is a data mesh and why should you build one? Inquiring minds want to know.
In the age of self-service business intelligence, nearly every company considers themselves a data-first company, but not every company is treating their data architecture with the level of democratization and scalability it deserves.
Your company, for one, views data as a driver of innovation. Your boss was one of the first in the industry to see the potential in Snowflake and Looker. Or maybe your CDO spearheaded a cross-functional initiative to educate teams on data management best practices and your CTO invested in a data engineering group. Most of all, however, your entire data team wishes there were an easier way to manage the growing needs of your organization, from fielding the never-ending stream of ad hoc queries to wrangling disparate data sources through a central ETL pipeline.
Underpinning this desire for democratization and scalability is the realization that your current data architecture (in many cases, a siloed data warehouse or a data lake with some limited real-time streaming capabilities) may not be meeting your needs.
Fortunately, teams seeking a new lease on data need look no further than a data mesh, an architecture paradigm that’s taking the industry by storm.
Much in the same way that software engineering teams transitioned from monolithic applications to microservice architectures, the data mesh is, in many ways, the data platform version of microservices.
As first defined by Zhamak Dehghani, a ThoughtWorks consultant and the original architect of the term, a data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Borrowing Eric Evans’ theory of domain-driven design, a flexible, scalable software development paradigm that matches the structure and language of your code with its corresponding business domain.
Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, domain-specific data consumers and views “data-as-a-product,” with each domain handling their own data pipelines. The tissue connecting these domains and their associated data assets is a universal interoperability layer that applies the same syntax and data standards.
Instead of reinventing Zhamak’s very thoughtfully built wheel, we’ll boil down the definition of a data mesh to a few key concepts and highlight how it differs from traditional data architectures.
(If you haven’t already, however, I highly recommend reading her groundbreaking article, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, or watching Max Schulte’s tech talk on why Zalando transitioned to a data mesh. You will not regret it).
At a high level, a data mesh is composed of three separate components: data sources, data infrastructure, and domain-oriented data pipelines managed by functional owners. Underlying the data mesh architecture is a layer of universal interoperability, reflecting domain-agnostic standards, as well as observability and governance. (Image courtesy of Monte Carlo Data.)
Data meshes federate data ownership among domain data owners who are held accountable for providing their data as products, while also facilitating communication between distributed data across different locations.
While the data infrastructure is responsible for providing each domain with the solutions with which to process it, domains are tasked with managing ingestion, cleaning, and aggregation to the data to generate assets that can be used by business intelligence applications. Each domain is responsible for owning their ETL pipelines, but a set of capabilities applied to all domains that stores, catalogs, and maintains access controls for the raw data. Once data has been served to and transformed by a given domain, the domain owners can then leverage the data for their analytics or operational needs.
Data meshes leverage principles of domain-oriented design to deliver a self-serve data platform that allows users to abstract the technical complexity and focus on their individual data use cases.
As outlined by Zhamak, one of the main concerns of domain-oriented design is the duplication of efforts and skills needed to maintain data pipelines and infrastructure in each domain. To address this, the data mesh gleans and extracts domain-agnostic data infrastructure capabilities into a central platform that handles the data pipeline engines, storage, and streaming infrastructure. Meanwhile, each domain is responsible for leveraging these components to run custom ETL pipelines, giving them the support necessary to easily serve their data as well as the autonomy required to truly own the process.
Underlying each domain is a universal set of data standards that helps facilitate collaboration between domains when necessary — and it often is. It’s inevitable that some data (both raw sources and cleaned, transformed, and served data sets) will be valuable to more than one domain. To enable cross-domain collaboration, the data mesh must standardize on formatting, governance, discoverability, and metadata fields, among other data features. Moreover, much like an individual microservice, each data domain must define and agree on SLAs and quality measures that they will “guarantee” to its consumers.
Your Data Architecture: Simple Best Practices for Your Data Strategy. Don't miss this helpful article.
A data lake is totally different from a data warehouse in terms of structure and function. Here is a truly quick explanation of "Data Lake vs Data Warehouse".
For Big Data Analytics, the challenges faced by businesses are unique and so will be the solution required to help access the full potential of Big Data.
This post explains what a data connector is and provides a framework for building connectors that replicate data from different sources into your data warehouse
In this post, we'll learn Getting Started With Data Lakes.<br><br> This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that's designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You'll also explore key benefits and common use cases.