In this post, we are going to explore LZ78, a lossless data-compression algorithm created by Lempel and Ziv in 1978. As an example, the GIF format is based on LZ78. LZ78 takes advantage of a dictionary-based data structure to compress our data. In this case, it makes use of a trie data structure, as it’s more efficient for this compression technique.

The motivation behind this approach was to get rid of the parameterization that was required to optimize LZ77’s performance. For instance, in LZ77 if our search buffer was too small, the resulting encoding would require more space, although the compression time would be lower. On the opposite side, if our search buffer was too big, the compression time would take longer, but the required space would be lower. Obviously, there is not a universal set of parameters that would perform optimally in every case: we need to optimize these paremeters depending on the pattern of our input data. All in all, one of the main motivations behind LZ78 was to create a universal compression algorithm that does not require any knowledge on the input.

Compression

Before getting into the details of the compression process, we should define the trie data structure that will help us store our dictionary of string patterns (also known as phrases):

It is a non-binary tree.The root node represents an empty string.Every node is marked with its dictionary index.Every edge contains the character that should be added to get the value of the child node.In order to get the value of a node, we just need to traverse from our target node to the root node, reading the resulting string from top to bottom.

With this data structure in mind, we may define the compression process:

We read the next character of the input string.We check whether the current node (starting at the root node) has any outbound edge that contains the character read in step 1.If so, we follow the outbound edge and set the current node as the found node. Then, we go back to step 1. Otherwise, we create an outbound edge with the character read in step 1, leading to a new node. The new node is then marked with an incremental index. We create an encoding tuple with the following structure (pn, c)i where pn represents the index of the parent node, c represents the new character read in step 1 and i represents the index of the new node. After appending the new tuple to the encoding stream, we set the current node as the root node and return to step 1.

#data-science

How LZ78 Compression Algorithm Works
5.50 GEEK