What is a Sequence Prediction problem?

The sequence prediction problem consists of finding the next element of an ordered sequence by only looking at the sequence’s items.

This problem covers a lot of applications in a variety of domains. It includes applications such as product recommendation, forecasting, and web page prefetching.

A lot of different approaches have been studied for this problem, popular ones include PPM (Prediction by Partial Matching), Markov chains, and more recently LSTM (Long short-term memory).

The Compact Prediction Tree (CPT) is an approach published in 2015 which aims to match the accuracy and outmatches the performances (time to train and predict) of popular algorithms with a lossless compression of the entire training set.

We will now enter into detail the methodologies to train, predict and what are the pros and cons of this method.

Compact Prediction Tree definition

Before entering into details of how it is used for making a prediction, let’s describe the different elements that compose a Compact Prediction Tree (CPT):

  • trie, for efficient storage of sequences.
  • An inverted index, for constant time retrieving of sequences containing a certain word.
  • lookup table, for retrieving the sequence from the sequence Id.

Trie

A trie, commonly called Prefix tree, is an ordered tree-based data structure to store sequences (such as strings). The elements of the sequences are stored in the edges, hence every descendent of a given node has the same prefix.

In this well-known example, we would like to store ["tea", "ten", "inn"].

We first put “tea” in the empty tree. Each branch of the tree corresponding to a letter. Then we add “ten”: as “tea” and “ten” share the same “te” prefix, we just create a new branch in our tree after the “te” prefix. Finally, we add “inn” in the tree that has no common prefix with the two previous sequences.

#machine-learning #predictions

Compact prediction tree
1.35 GEEK