If you know about Donald Trump, you would probably know what Trump would say after the
words, "Make America". Of course, he would say, "Great Again". We know this because we are
humans and we have brains.
But, let us see, which models can understand Trump better?
We will compare 5 language models starting from N-grams, A simple RNN, LSTM, GRU and finally, a Transformer.
GitHub for full code: https://github.com/yashasnadigsyn/trumpism/
Colab for comparison: https://colab.research.google.com/drive/1gd2oAOlzANl-fi6teAdQfea-Py2TASvn?usp=sharing
The fundamental goal of a language model is to predict the probability of a sequence of words (or characters).
Given some text, the model already has seen, what's the likelihood that the next word will be a specific word?
The goal of the language models is to estimate the joint probability of the whole sequence: P(x1, x2,…,XT).
Let's say T is 5, and the sequence is "the cat sat on mat". The equation is figuring out the likelihood of that exact sequence appearing in that specific order. A good language model assigns a higher probability
to plausible sequences and a lower probability to nonsensical ones.
The Core Idea of language models is to predict the next word in a sequence, given the words that came before.
The obvious question is how to model a document, or a sequence of tokens?
We use the chain rule of probability applied to a sequence of words: P(x1, x2, …, xT) = P(x1) * P(x2 | x1) * … * P(xT | x1, …, xT-1)
N-grams
N-gram is more of a statistical model than a deep learning model. All the other models in this
blog uses neural networks except N-gram.
N-grams uses markov property which is basically
a sequence has the Markov property (of order 1) if the probability of the next word depends only on the current word. Higher orders mean it depends on more previous words.
N-grams are models that use this Markov assumption. "n" represents the number of words considered in the context. For example, n=2 (bigrams) mean that probability of a word depends
only on the previous word and for n=3(trigrams), probability of a word depends only on the previous two words.
Obviously, I am not going to explain the whole N-gram here. But, if you want to learn more, look here.
I created a simple streamlit app to see how N-grams work for Trump Transcript. The vocab size is nearly 8k. Comparing this to GPT-3, which has 50k vocab size. So, this model is very minimal.
Demo: Streamlit
Result:
Given: make americaGenerated: great again .
RNN
Markov models (and n-grams) have a fundamental limitation: they can only "remember" a fixed number of previous words. If we want to capture long-range dependencies in text, we need to increase the order of the Markov model (the value of n). However, increasing n leads to an exponential increase in the number of parameters the model needs to store. If |V| is the size of our vocabulary, then an n-gram model needs to store |V|^n numbers to represent all the probabilities. This becomes computationally infeasible very quickly. A moderate vocabulary of 20,000 words with n=5 would need 3.2 * 1021 parameters! Thus, RNNs came and solved some pretty serious problems which could not be solved by N-grams.
Demo: Colab
Result:
Given: make americaGenerated: white community the best
LSTM
LSTMs are essentially RNNs with a special "memory cell" that replaces the ordinary recurrent node in standard RNNs.
This memory cell is the heart of the LSTM and provides a mechanism for selectively remembering or forgetting information over time.
Each LSTM contains both long-term memory (weights) and short-term memory (activations), as well as a special medium-term memory.
The memory cell has an internal state and a number of "gates" that control the flow of information into and out of the cell.
- Input Gate: How much of the new input should affect the internal state.
- Forget Gate: Whether the current value of the memory should be flushed (set to zero).
- Output Gate: Whether the internal state of the neuron is allowed to affect the cell's output.
Demo: Colab
Result:
Given: make americaGenerated: really as i was talking about
GRU
RNNs and especially LSTMs became very successful in the 2010s for handling sequential data (like text, time series, etc.). They were great at remembering information over time. However, LSTMs, while powerful, can be computationally expensive.
The GRU architecture was developed as a more computationally efficient alternative to LSTMs. The goal was to simplify the LSTM's gating mechanism while retaining its ability to capture long-range dependencies and achieve comparable performance in many sequence modeling tasks.
Here, the LSTM's three gates are replaced by two: the reset gate and the update gate.
The reset gate controls how much of the previous state we still want to remember.
The update gate controls how much the new input and old hidden state to be in new hidden state.
Demo: Colab
Result:
Given: make americaGenerated: community and other responsibilities with a through line
Transformers
Finally, we are here at Transformers.
While deep learning in 2010s was largely driven by MLPs, CNNs, and RNNs, the architecture itself hadn't changed drastically from older neural network concepts.
Innovations were more in training techniques (ReLU, BatchNorm, etc.) and leveraging more compute and data.
Transformer architecture was a big fundamental shift. The "core idea" behind Transformers is the attention mechanism. This was initially proposed as an enhancement for encoder-decoder RNNs in machine translation.
I am not going to give you the generated output for this part because it really takes a lot of time to run it.
But, I will give you the code if you wish to run it: Code for Transformers.
The above Transformers code is completely inspired and i have also copied several parts from wingedsheep.
I edited the code to use our Trump Transcript dataset and have added some preprocessing techniques.
Conclusion
Looking at the results, we see an interesting, almost counter-intuitive outcome: the simplest model, the N-gram,
produced the exact phrase "great again," arguably the most expected completion given the dataset. The more sophisticated RNN, LSTM, and GRU models, however, generated different,
less predictable sequences.
Why did this happen? The N-gram model's success here highlights its strength: memorizing and retrieving high-frequency sequences.
The phrase "make america great again" is undoubtedly one of the most common sequences in the Trump transcript corpus.
N-grams, relying purely on conditional probabilities of adjacent words (like P(great | america) or P(great | make america)), excel at capturing these extremely frequent patterns.
However, this is what we exactly we don't want the model to do. We want the model to generalize not just memorize and repeat the same sentence everytime.
Rare words like rarely appear or never appear and one big problem with N-grams are, they can easily get stuck in loops, repeatedly outputting common phrases without progressing meaningfully.
The RNNs (or LSTMS/GRU) are generalize better and give different outputs everytime which is much natural. Also, another problem is that, our vocabulary is very small (8k) and these models often require significantly more training data and epochs than used here to fully converge.
Finally, the Transformer architecture represents the current state-of-the-art, largely because its self-attention mechanism directly addresses the long-range dependency problem far more effectively than even LSTMs or GRUs.
Also, the calculation for each position can be done in parallel once the initial queries, keys, and values are computed. There are no step-by-step dependencies like in RNNs. This is why Transformers are state-of-the-art models used for LLMs.