Let’s take a look at the compression algorithm behind Unix’s compress and most .gif compression.

The Lempel Ziv Welch [LZW] algorithm is a greedy lossless compression algorithm that works by replacing recurring patterns with shorter codes in order to save space.

Lossless is a form of compression where no data is lost.

We’ll go over the algorithm and take a look at an implementation in Python.

For simplicity’s sake, we’ll limit the discussion to encoding ASCII characters.

Encoding

Essentially, if as we’re processing a text file, we’re able to identify recurring phrases, we can substitute those phrases with a code that would be shorter than the original text. Doing this over the entire document gives us the compression we’re after.

LZW relies on a dictionary that maps a phrase to a unique code. The first 256 entries in the dictionary are the single ASCII characters mapped to their numeric code.

For example, at the start, the dictionary might look something like this:

{
  "A": "65",
  "B": "66",
  ....
  "a": "97",
  "b": "98"
}

#python #algorithms #compression

Unix's LZW Compression Algorithm: How Does It Work?
8.50 GEEK