Perplexity is a common metric to use when evaluating language models. For example, scikit-learn’s implementation of Latent Dirichlet Allocation (a topic-modeling algorithm) includes perplexity as a built-in metric.

In this post, I will define perplexity and then discuss entropy, the relation between the two, and how it arises naturally in natural language processing applications.


A quite general setup in many Natural Language tasks is that you have a language L and want to build a model _M _for the language. The “language” could be a specific genre/corpus like “English Wikipedia”, “Nigerian Twitter”, or “Shakespeare” or (conceptually at least) just a generic like “French.”

Specifically by a language _L, _we mean a process for generating text. For clarity, we will consider the case where we are modeling sentences and the text consists of sequence words ending with an end of sentence “word.” But you can replace “word” with “token” and “sentence” with “document” to generalize to any context.

What is a “process”? For our purposes, we can think of a process as a collection of probability distributions. Given a history h consisting of a series of previous words in a sentence, the language L is the probability that the next word is w:

Image for post

A language is a collection of probability distribution

For example, I am willing to wager that if L is “English”:

  1. L(dog | The quick brown fox jumps over the lazy brown) ≈ 1
  2. L(ipsum | Lorem) ≈ 1
  3. L(wings | Buffalo buffalo buffalo Buffalo buffalo) ≈ 0

Similarly, given an entire sentence s, we can evaluate L(s) the probability of the sentence occurring. If we include a special beginning of sentence “word” wₒ and let the n-th “word” be the end-of-sentence “word”, we get

Image for post

The Language L gives the probability of a sentence s

However it is common to leave out the first term in the product as well, or sometimes to work with an even longer starting context.

It is surprisingly easy to get a perfect replica of _L _of (say) spoken American English. Just flag down any native English speaker walking down the street. Of course, we are usually interested in teaching a computer the model (hence, Machine Learning). So we will let _M _be whatever language model we have managed to build on a computer.

This setup, with a language L and model _M _is quite general and plays a role in a variety of Natural Language tasks: speech-to-text, autocorrect, autocomplete, machine translation – the list goes on. Autocomplete is the most obvious example: given the words someone has typed so far, try to guess what they might type next by picking the highest-probability completion.¹


Given a language model M, we can use a held-out dev (validation) set to compute the perplexity of a sentence. The perplexity on a sentence _s _is defined as:

Image for post

Perplexity of a language model M

You will notice from the second line that this is the inverse of the geometric mean of the terms in the product’s denominator. Since each word has its probability (conditional on the history) computed once, we can interpret this as being a _per-word _metric. This means that, all else the same, the perplexity is not affected by sentence length.

In general, we want our probabilities to be high, which means the perplexity is low. If all the probabilities were 1, then the perplexity would be 1 and the model would perfectly predict the text. Conversely, for poorer language models, the perplexity will be higher.

#artificial-intelligence #machine-learning #data-science #nlp #data analysis

The relationship between Perplexity and Entropy in NLP
1.40 GEEK