As the Director of Data Engineering at Delhivery (India’s leading fulfilment platform for digital commerce), I’m surrounded by a huge amount of data. Over 1.2 TB per day, to be exact.

Delhivery fulfills a million packages a day, 365 days a year. Its 24 automated sort centres, 85+ fulfillment centres, 70 hubs, 3,000+ direct delivery centres, 7,500+ partner centres, 15,000+ vehicles and 40,000+ team members run smoothly thanks to a vast network of IoT devices. There are nearly 60,000 data events and messages coming in and going out of our pipelines per second.

With this much data, it’s probably no surprise that data discovery and organization are a big challenge.

We finally found our dream data cataloging solution, but it was no simple task.

At Delhivery, we started our journey with a data catalog in 2019. Over the next year and a half, we considered and invested in several different types of solutions. We evaluated traditional enterprise-focused data catalogs (e.g. Alation, Collibra, and Waterline), built our own catalog with Atlas and Amundsen, and later adopted the modern SaaS unified data workspace, Atlan.

In this blog post, I’m unpacking all of my learnings as a lead engineer on this project in the hopes that it helps other data practitioners and leaders who are facing similar challenges and want to successfully implement a modern data catalog.

Why Delhivery desperately needed a data cataloging solution

As Delhivery has grown over the past decade, the scale and complexity of its data have grown even faster.

_Earlier in its history, Delhivery generated 1 TB of data per quarter. Now we’re generating 1.2 TB per day. _

That data is organized and processed by hundreds of microservices, which means that ownership over our data is distributed across different teams. Delhivery started with a monolithic system, but we took the call to start forking services about 4 years later as the business scaled exponentially.

Teams soon started building their own microservices, motivated by a desire to make data-driven decisions. Everyone wanted to find and access specific data, so they’d reach out to several developers and ask for help. Later on, we realized that developers were becoming a bottleneck, and we needed a Google Search–style way for anyone to find data through a common portal.

However, finding data wasn’t the only issue. Once teams got hold of the data they wanted, they struggled to understand it. There wasn’t a clear way to organize our data and add important information and business context. This quickly became clear throughout our onboarding process — the typical time to onboard a new team member was 1-2 months, but this process eventually grew to 3-4 months as Delhivery and its data kept growing.

By 2019, we realized we desperately needed a data cataloging solution, one where people could navigate through all our data, look for whatever they need, check what a data asset looks like, build a better understanding of other domains or teams within the company, and even add their own info or context about a data asset.

Step 1: Evaluating available commercial data catalog solutions in the market

We started off our search for a data catalog by evaluating commercial products like Alation, Waterline (now called Lumada) and Colibra. We dove deep into their specs and compared them on two main criteria:

  • Features: Did the product have all the features we needed?
  • TCO: What was the total cost of ownership (i.e. the purchase price plus other costs like set-up or operations)? Did it fit with our budget?

Buying one of these products would have been the simplest fix, but unfortunately, we couldn’t find the right solution. Each one was either missing non-negotiable features (such as seeing a data preview or querying data) or the TCO was just too high for us (due to expensive set-up, licensing and professional service fees).

We didn’t want to settle for something that wasn’t quite right, since setting up a data catalog is a huge commitment. So we realized we needed to build and customise our own internal solution.

#data #data-cataloging #data-management #data-management-platforms #big-data

Build vs Buy: What We Learned by Implementing a Data Catalog
1.15 GEEK