Delta Compression: Diff Algorithms And Delta File Formats [Practical Guide]

A diff algorithm outputs the set of differences between two inputs. These algorithms are the basis of a number of commonly used developer tools. Yet understanding the inner workings of diff algorithms is rarely necessary to use said tools.

Git is one example where a developer can read, commit, pull, and merge diffs without ever understanding the underlying diff algorithm. Having said that there is very limited knowledge on the subject across the developer community.

The purpose of this article is not to detail how Ably programmatically implemented a diff algorithm across its distributed pub/sub messaging platform, but rather to share our research and provide systematic knowledge on the subject of diff algorithms that could be useful to implementers of diff/delta/patch functionality.

A quick bit of context

For Ably customers like Tennis Australia or HubSpot, Message Delta Compression reduces the bandwidth required to transmit realtime messages by sending only the diff of a message.

This means subscribers receive only the changes since the last update instead of the entire stream. Sending fewer bits is more bandwidth-efficient and reduces overall costs and latencies for our customers. To develop this feature we needed to implement a diff algorithm that supported binary encoding and didn’t sacrifice latency when generating deltas.

Diff algorithms

Purpose and usage

The output of a diff algorithm is called patch or delta. The delta format might be human readable (text) or only machine readable (binary). Human readable format is usually employed for tracking and reconciling changes to human readable text like source code. Binary format is usually space optimized and used in order to save bandwidth. It transfers only the set of changes to an old version of the data already available to a recipient as opposed to transferring all the new data. The formal term for this is delta encoding.

Binary VS Text?

There seems to be a common misconception that diff algorithms are specialized based on the type of input. The truth is, diff algorithms are omnivorous and can handle any input, as long as the input can simply be treated as a string of bytes. That string might consist of the English alphabet or opaque binary data. Any diff algorithm will generate a correct delta given two input strings in the same alphabet.

The misconception that a different algorithm is required to handle binary data arises from commonly used diff/merge tools treating text and binary as if they were actually different. These tools generally aim to provide a human-readable delta, and as such focus on human-readable input to the exclusion of binary data.

The assumption is that binary data is not human-readable so the delta between two binary data inputs will also not be human readable, and thus rendering it human-readable is deemed to be too much effort.

Equality is the only relevant output in the case of binary diffs, and as such, a simple bit-by-bit comparison is considered to be the fastest and most appropriate solution. This categorization of algorithms by the efficiency of solution causes a partitioning of inputs into different types.

Another aspect that adds to the confusion is the line-based, word-based, and character-based classification of textual diff outputs produced by diff/merge tools. A diff algorithm that is described as “line-based” gives the impression that it produces “text-only” output, and that this means that it accepts only text input and never binary data inputs.

#data #compression #distributed-systems #aws #git #diff-algorithms #delta-compression #good-company

A quick bit of context

Diff algorithms

hackernoon.com

Delta Compression: Diff Algorithms And Delta File Formats [Practical Guide]