Perplexity

4 Mar 2025 ML NLP RNN

You might be here because you typed "perplexity" into Google and got a bunch of AI search engine ads. But, if your language model is thinking, "It is raining banana tree" is a perfect valid statement, you've got a perplexity problem! Let's make sure your language model doesn't think bananas grow on clouds using "perplexity".
We need to understand few things before learning about perplexity. So, let's dive in!

How to evaluate a language model?

Let's say, you have built a language model. How do you know it is any good? The core idea is that a good language model should be able to predict the next word in a sequence with high accuracy.
Consider the following continuations of the phrase "It is raining", as proposed by different language models:

It is raining outside
It is raining banana tree
It is raining piouw;kcj pwepoiut

Example 1 is clearly the winner here with it's both grammatically and semantically correct statement. Example 2 is worse but atleast, the grammar is correct and it knew how to spell words. Example 3 is very worse with both grammatically and semantically nonsensical statement.

Likelihood and it's problems

We can measure the quality of a model by computing the likelihood of the sequence. Unfortunately, this is problematic:

Longer sequences tend to have lower probabilities simply because they're longer and they appear a low number of times in the training dataset.
A sentence in a short story will naturally have a higher probability than an entire chapter in a novel, even if the model is equally good at predicting words in both. Short sentences generally have high probabilities because they may appear more times in a book.
Say, You are reading a Deep Learning book. The sentence, "The weights of the" is short and more likely to appear in the book than the sentence, "The weights of the neural network", because it is long and there maybe more sentences regarding weights of models other than neural networks.
If we train the model and want to test how well it does, we need to consider that different documents will have different lengths, and that will impact the likelihood calculation.

Enter Information Theory: Our Hero

This is where Information Theory comes in. A good language model tells us how likely each word is, given the context. If the model is very confident that a particular word is coming next, we can use short code to represent that word. If the model is uncertain, we need a longer code.
Checkout this video to get a intuition about Cross Entropy by Artem Kirsanov.
Basically, Cross-Entropy measures the average difference between the true and predicted probability distributions. Note: The formula for cross entropy loss is: $$\frac{1}{n} \sum_{t=1}^{n} (-\log P(x_t | x_{t-1}, \ldots, x_1))$$
Where,
n: The total number of tokens (words) in the sequence.
xt: The actual word observed at time step t in the sequence.
P(xt | xt-1, ..., x1): This is the model's prediction.

Let's take an example, "Perplexity is a good metric".

Word 1: "Perplexity"
Let's say our language model predicts that the probability of "Perplexity" starting a sentence is P(Perplexity) = 0.1.
The surprisal is -ln(0.1) ≈ 2.3 bits. This means we need about 2.3 bits to encode the word "Perplexity".
Word 2: "is"
The model predicts P(is | Perplexity) = 0.8. The model thinks "is" is fairly likely to follow "Perplexity".
The surprisal is -ln(0.8) ≈ 0.22. Here, we need very few bits to encode "is" because the model was much more confident in it's prediction than the previous one.
Word 3: "a"
The model predicts P(a | Perplexity, is) = 0.9.
The surprisal is -ln(0.9) ≈ 0.1. Here, we need even less bits than the previous one to encode.
Word 4: "good"
The model predicts P(good| Perplexity, is, a) = 0.7.
The surprisal is -ln(0.7) ≈ 0.35.
Word 5: "metric"
The model predicts P(metric | Perplexity, is, a, good) = 0.4. The model thinks "metric" is unlikely to follow "good".
The surprisal is -ln(0.4) ≈ 0.9.

Calculating Cross-Entropy Loss
Sum of Surprisals: 2.3 + 0.22 + 0.1 + 0.35 + 0.9 ≈ 3.87 bits.
Number of Words: n = 5
Cross-Entropy Loss: (1/5) * 3.87 ≈ 0.774 bits per word.
This means that, on average, our language model requires about 0.774 bits to encode each word in the sentence "Perplexity is a good metric".

Here comes Perplexity: Our new hero

Perplexity is simply the exponential of the cross-entropy loss.
$$\exp\left(\frac{1}{n} \sum_{t=1}^{n} (-\log P(x_t | x_{t-1}, \ldots, x_1))\right)$$ Perplexity is the measure of how "confused" a language model is when predicting the next word in a sequence. A lower perplexity means a more confident and better model and accurate in it's predictions.

Some cases:
Best Case(Perplexity = 1): The model always perfectly predicts the next word with a probability of 1. It's never surprised. This is the theoretical ideal, but it's practically impossible to achieve on real-world language data.
Worst Case(Perplexity = Infinity): The model always predicts the next word with a probability of 0. It's completely wrong every time.
Baseline(Perplexity = Vocabulary Size): If the model predicts a uniform distribution over all words in the vocabulary (i.e., it assigns equal probability to every word), then the perplexity is equal to the number of unique words in the vocabulary. This represents a naive model that has no knowledge of the language. The goal is to always beat this baseline.

Conclusion

In simple words, Perplexity is the number of words the model has to consider when it is trying to predict the next word. The lower the number, the better the language model is.

Reference

D2l.ai