At GitHub, we serve some of the largest Git repositories on the planet. We also serve some of the fastest-growing repositories. Each day, the largest repositories we host become even larger.

About a year ago, we noticed that the job we use to repack Git repositories began hitting our self-imposed timeouts on larger repositories. Even when expanding these timeouts, failing maintenance on these repositories has generally been the cause of degraded performance that is hard to mitigate.

Today, these problems do not exist. GitHub can repack even the largest repositories we host in a fraction of the time it used to take. In this post, we’ll talk about what problems we were encountering, the solutions we built, how we deployed them safely, and describe some possible future directions.

All of our work here is being contributed to the open source Git project, and will be available in an upcoming release.

The problem

Why is GitHub’s maintenance job so expensive in the first place? It’s because we chose to have maintenance repack the entire contents of each repository into a single packfile. Doing so is expensive, but having just one packfile carries some benefits, too. With only one packfile, looking up objects doesn’t require opening and searching through multiple packs to find it. It also means that all objects can be compressed as a delta relative to all other objects (Git’s packfile format supports cross-pack deltas, but currently Git will never store them on disk). But, the most important reason is that  reachability bitmaps, a performance-critical data structure, are only compatible with a single pack.

A new feature in Git,  multi-pack indexes solves the former problem by making all object lookups go through a single index, but didn’t yet solve the latter. So, we set out to fill in the gaps by bringing bitmap support to multi-pack indexes in order to remove the single-pack limitation on reachability bitmaps.

But in order to build multi-pack bitmaps, we had to solve a number of other interesting problems along the way. First, we had to decide how to arrange the objects in a multi-pack index to achieve good bitmap compression. We also had to figure out how to quickly invert that ordering to translate between bit positions back to the objects they refer to. Some of these steps also yielded notable performance improvements on single-pack repositories, too. Finally, we had to figure out a new repacking strategy that scaled with the size of recent pushes, rather than with the size of the entire repository.

But before we get into all of that, let’s start from the very beginning.

#engineering #open source #git #monorepo

Scaling Monorepo Maintenance
4.05 GEEK