Ask anyone in the data industry what’s hot these days and chances are “data mesh” will rise to the top of the list. But what is a data mesh and why should you build one? Inquiring minds want to know.

In the age of self-service business intelligence, nearly every company considers themselves a data-first company, but not every company is treating their data architecture with the level of democratization and scalability it deserves.

Your company, for one, views data as a driver of innovation. Your boss was one of the first in the industry to see the potential in Snowflake and Looker. Or maybe your CDO spearheaded a cross-functional initiative to educate teams on data management best practices and your CTO invested in a data engineering group. Most of all, however, your entire data team wishes there were an easier way to manage the growing needs of your organization, from fielding the never-ending stream of ad hoc queries to wrangling disparate data sources through a central ETL pipeline.

Underpinning this desire for democratization and scalability is the realization that your current data architecture (in many cases, a siloed data warehouse or a data lake with some limited real-time streaming capabilities) may not be meeting your needs.

Fortunately, teams seeking a new lease on data need look no further than a data mesh, an architecture paradigm that’s taking the industry by storm.

What is a data mesh?

Much in the same way that software engineering teams transitioned from monolithic applications to microservice architectures, the data mesh is, in many ways, the data platform version of microservices.

As first defined by Zhamak Dehghani, a ThoughtWorks consultant and the original architect of the term, a data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Borrowing Eric Evans’ theory of domain-driven design, a flexible, scalable software development paradigm that matches the structure and language of your code with its corresponding business domain.

Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, domain-specific data consumers and views “data-as-a-product,” with each domain handling their own data pipelines. The tissue connecting these domains and their associated data assets is a universal interoperability layer that applies the same syntax and data standards.

Instead of reinventing Zhamak’s very thoughtfully built wheel, we’ll boil down the definition of a data mesh to a few key concepts and highlight how it differs from traditional data architectures.

(If you haven’t already, however, I highly recommend reading her groundbreaking article, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, or watching Max Schulte’s tech talk on why Zalando transitioned to a data mesh. You will not regret it).

Image for post

At a high level, a data mesh is composed of three separate components: data sources, data infrastructure, and domain-oriented data pipelines managed by functional owners. Underlying the data mesh architecture is a layer of universal interoperability, reflecting domain-agnostic standards, as well as observability and governance. (Image courtesy of Monte Carlo Data.)

Domain-oriented data owners and pipelines

Data meshes federate data ownership among domain data owners who are held accountable for providing their data as products, while also facilitating communication between distributed data across different locations.

While the data infrastructure is responsible for providing each domain with the solutions with which to process it, domains are tasked with managing ingestion, cleaning, and aggregation to the data to generate assets that can be used by business intelligence applications. Each domain is responsible for owning their ETL pipelines, but a set of capabilities applied to all domains that stores, catalogs, and maintains access controls for the raw data. Once data has been served to and transformed by a given domain, the domain owners can then leverage the data for their analytics or operational needs.

Self-serve functionality

Data meshes leverage principles of domain-oriented design to deliver a self-serve data platform that allows users to abstract the technical complexity and focus on their individual data use cases.

As outlined by Zhamak, one of the main concerns of domain-oriented design is the duplication of efforts and skills needed to maintain data pipelines and infrastructure in each domain. To address this, the data mesh gleans and extracts domain-agnostic data infrastructure capabilities into a central platform that handles the data pipeline engines, storage, and streaming infrastructure. Meanwhile, each domain is responsible for leveraging these components to run custom ETL pipelines, giving them the support necessary to easily serve their data as well as the autonomy required to truly own the process.

Interoperability and standardization of communications

Underlying each domain is a universal set of data standards that helps facilitate collaboration between domains when necessary — and it often is. It’s inevitable that some data (both raw sources and cleaned, transformed, and served data sets) will be valuable to more than one domain. To enable cross-domain collaboration, the data mesh must standardize on formatting, governance, discoverability, and metadata fields, among other data features. Moreover, much like an individual microservice, each data domain must define and agree on SLAs and quality measures that they will “guarantee” to its consumers.

#database #data-analytics #engineering #data-mesh #data #data analysisa

What is a data mesh?

Domain-oriented data owners and pipelines

Self-serve functionality

Interoperability and standardization of communications

towardsdatascience.com

What is a Data Mesh — and How Not to Mesh it Up