Data Lake and Data Mesh Use Cases

As data mesh advocates come to suggest that the data mesh should replace the monolithic, centralized data lake, I wanted to check in with Dipti Borkar, co-founder and Chief Product Officer at Ahana. Dipti has been a tremendous resource for me over the years as she has held leadership positions at Couchbase, Kinetica, and Alluxio.

Definitions

  • A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the resource format and in addition to the originating data stores.
  • A data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Mesh is an abstraction layer that sits atop data sources and provides access.

According to Dipti, while data lakes and data mesh both have use cases they work well for, data mesh can’t replace the data lake unless all data sources are created equal — and for many, that’s not the case.

Data Sources

All data sources are not equal. There are different dimensions of data:

  • Amount of data being stored
  • Importance of the data
  • Type of data
  • Type of analysis to be supported
  • Longevity of the data being stored
  • Cost of managing and processing the data

Each data source has its purpose. Some are built for fast access for small amounts of data, some are meant for real transactions, some are meant for data that applications need, and some are meant for getting insights on large amounts of data.

AWS S3

Things changed when AWS commoditized the storage layer with the AWS S3 object-store 15 years ago. Given the ubiquity and affordability of S3 and other cloud storage, companies are moving most of this data to cloud object stores and building data lakes, where it can be analyzed in many different ways.

Because of the low cost, enterprises can store all of their data — enterprise, third-party, IoT, and streaming — into an S3 data lake. However, the data cannot be processed there. You need engines on top like Hive, Presto, and Spark to process it. Hadoop tried to do this with limited success. Presto and Spark have solved the SQL in S3 query problem.

#big data #big data analytics #data lake #data lake and data mesh #data lake #data mesh

Data Lake and Data Mesh Use Cases