Scaling monorepo maintenance. Design a packfile maintenance strategy that uses multiple packfiles. To scale a monorepo, you need to split it up into multiple repos. Today, GitHub can repack even the largest repositories we host in a fraction of the time it used to take. Here's how we did it, and why.
At GitHub, we serve some of the largest Git repositories on the planet. We also serve some of the fastest-growing repositories. Each day, the largest repositories we host become even larger.
About a year ago, we noticed that the job we use to repack Git repositories began hitting our self-imposed timeouts on larger repositories. Even when expanding these timeouts, failing maintenance on these repositories has generally been the cause of degraded performance that is hard to mitigate.
Today, these problems do not exist. GitHub can repack even the largest repositories we host in a fraction of the time it used to take. In this post, we’ll talk about what problems we were encountering, the solutions we built, how we deployed them safely, and describe some possible future directions.
All of our work here is being contributed to the open source Git project, and will be available in an upcoming release.
Why is GitHub’s maintenance job so expensive in the first place? It’s because we chose to have maintenance repack the entire contents of each repository into a single packfile. Doing so is expensive, but having just one packfile carries some benefits, too. With only one packfile, looking up objects doesn’t require opening and searching through multiple packs to find it. It also means that all objects can be compressed as a delta relative to all other objects (Git’s packfile format supports cross-pack deltas, but currently Git will never store them on disk). But, the most important reason is that reachability bitmaps, a performance-critical data structure, are only compatible with a single pack.
A new feature in Git, multi-pack indexes solves the former problem by making all object lookups go through a single index, but didn’t yet solve the latter. So, we set out to fill in the gaps by bringing bitmap support to multi-pack indexes in order to remove the single-pack limitation on reachability bitmaps.
But in order to build multi-pack bitmaps, we had to solve a number of other interesting problems along the way. First, we had to decide how to arrange the objects in a multi-pack index to achieve good bitmap compression. We also had to figure out how to quickly invert that ordering to translate between bit positions back to the objects they refer to. Some of these steps also yielded notable performance improvements on single-pack repositories, too. Finally, we had to figure out a new repacking strategy that scaled with the size of recent pushes, rather than with the size of the entire repository.
But before we get into all of that, let’s start from the very beginning.
Open source today is a word that often include a lot of things, such as open knowledge (Wikimedia projects), open hardware (Arduino, Raspberry Pi), open formats (ODT/ODS/ODP) and so on.
Git has become ubiquitous as the preferred version control system (VCS) used by developers. Using Git adds immense value especially for engineering teams where several developers work together since it becomes critical to have a system of integrating everyone's code reliably.
With Google not owning the trademarks or control for Kubernetes, it also provided a competitive edge to AWS, Microsoft, IBM etc.
It's October and we're calling all programmers, designers, content writers and open-source contributors to join Hacktoberfest 2020. This is a fantastic opportunity to contribute to open-source or try your hand at something new.
The world today is more democratic for those who want or need to use computers, more precisely those who need to make use of computer programs. But this was not always the case, and in part what made access to the computer world something a little simpler or less expensive was open source or open source software. But what exactly is it?