GPT and BERT: A Comparison of Transformer Architectures
Leonard Püttmann
Posted on February 9, 2023
Transformer models such as GPT and BERT have taken the world of machine learning by storm. While the general structures of both models are similar, there are some key differences. Let’s take a look.
The original Transformer architecture
The first transformer was presented in the famous paper "attention is all you need" by Vaswani et al. The transformer models were intended to be used for machine translation and used as encoder-decoder architecture that didn't rely on things like recurrence. Instead, the transformer focused on something called attention. In a nutshell, attention is like a communication layer that is put on top of tokens in a text. This allows the model to learn the contextual connections of words in a sentence.
From this original transformer paper, different models emerged, some of which you might already know. If you spent a little of your time exploring transformers already, you've probably come across this image, outlining the architecture of the first transformer model.
The approach to using an encoder and a decoder is nothing new. It means that you train two neural networks and use one of them for encoding and one of them for decoding. This is not limited to transformers, as we can use the encoder-decoder architecture with other types of neural networks like LSTM (Long-Short Term Memory). This is especially useful if we would like to convert an input into something else, like a sentence of one language into another language. Or an image into a text description.
The crux of the transformer is the use of (self-)attention. Things like recurrence are dropped completely, hence the name "attention is all you need" of the original paper!
GPT vs BERT: What’s The Difference?
The original transformer paper sprouted lots of really cool models, such as the all-mighty GPT or BERT.
GPT stands for Generative Pre-trained Transformer, and it was developed by OpenAI to generate human-like text from given inputs. It uses a language model that is pre-trained on large datasets of text to generate realistic outputs based on user prompts. One advantage GPT has over other deep learning models is its ability to generate long sequences of text without sacrificing accuracy or coherence. In addition, it can be used for a variety of tasks, including translation and summarization.
BERT, which stands for Bidirectional Encoder Representations from Transformers, was developed by the Google AI Language team and open-sourced in 2018. Unlike GPT, which only processes input from left to right like humans read words, BERT processes input both left to right and right to left in order to better understand the context of given texts. Furthermore, BERT has also been shown to outperform traditional NLP models such as LSTMs on various tasks related to natural language understanding.
There is, however, an extra difference in how BERT and GPT are trained:
BERT is a Transformer encoder, which means that, for each position in the input, the output at the same position is the same token (or the [MASK] token for masked tokens), that is the inputs and output positions of each token are the same. Models with only an encoder stack like BERT generate all its outputs at once.
GPT is an autoregressive transformer decoder, which means that each token is predicted and conditioned on the previous token. We don't need an encoder, because the previous tokens are received by the decoder itself. This makes these models really good at tasks like language generation, but not good at classification. These models can be trained with unlabeled large text corpora from books or web articles.
The special thing about transformer models is the attention mechanism, which allows these models to understand the context of words more deeply.
How does attention work?
The self-attention mechanism is a key component of transformer models, and it has revolutionized the way natural language processing (NLP) tasks are performed. Self-attention allows for the model to attend to different parts of an input sequence in parallel, allowing it to capture complex relationships between words or sentences without relying on recurrence or convolutional layers. This makes transformer models more efficient than traditional recurrent neural networks while still being able to achieve superior results in many NLP tasks. In essence, self-attention enables transformers to encode global context into representations that can be used by downstream tasks such as text classification and question answering.
Let's take a look at how this work. Imagine that we have a text x, which we convert from raw text using an embedding algorithm. To then apply the attention, we map a query (q) as well as a set of key-value pairs (k, v) to our output x. Both q, k, as well as v, are vectors. The result z is called the attention-head and is then sent along a simple feed-forward neural network.
If this sound confusing to you, here is a visualization that highlights connections that are built by the attention mechanism:
You can explore this yourself in this super cool Tensor2Tensor Notebook here.
In conclusion, while both GPT and BERT are examples of transformer architectures that have been influencing the field of natural language processing in recent years, they have different strengths and weaknesses that make them suitable for different types of tasks. GPT excels at generating long sequences of text with high accuracy whereas BERT focuses more on the understanding context within given texts in order to perform more sophisticated tasks such as question answering or sentiment analysis. Data scientists, developers, and machine learning engineers should decide which architecture best fits their needs before embarking on any NLP project using either model. Ultimately, both GPT and BERT are powerful tools that offer unique advantages depending on the task at hand.
Get refinery today
Download refinery, our data-centric IDE for NPL. In our tool, you can use state-of-the-art transformer models to process and label your data.
Get it for free here: https://github.com/code-kern-ai/refinery
Further articles:
NanoGPT by Andrej Kaparthy https://github.com/karpathy/nanoGPT/blob/master/train.py
BERT model explained https://www.geeksforgeeks.org/explanation-of-bert-model-nlp/
Encoder-decoder in LSTM neural nets https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/
The illustrated transformer by Jay Alammar http://jalammar.github.io/illustrated-transformer/
Posted on February 9, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.