Large Language Models: Comparing Gen 1 Models (GPT, BERT, T5 and More)

The creation of Large Language Models (LLMs) began in 2018. Three factors emerged and were combined in LLMs: powerful computer and graphics processing units, huge amounts of structured and unstructured data that could be processed fast, and first-grade open-source project for the creation and training of neural networks.

LLMs are based on the transformer architecture of neural networks. This specific type of network enables text processing so that individual words are weighted in the context of all surrounding words. Given enough texts and a large context window, neural networks learn the meanings and relationships of words, sentences, and whole paragraphs. The essential capability of such a model is to predict the most likely word given a context of other words.

Since their inception, Language Models became even more sophisticated, rivalling human performance on natural language understanding. To understand the humble beginnings of LLMs, this article explains the first LLMs that were developed and publicly released. The charm of these models lies in the rich number of tutorials for continued training and fine tuning so that they cover a wide range of NLP tasks.

This article covers the first large-scale LLMs from the timespan 2018-02 to 2020-06. For each model, you will learn its architecture, the pretraining approach and material, and the fine-tuning training steps and resulting benchmark results. The following LLMs are covered in-depth:

GPT-1
BERT
RoBERTa
DistillBERT
BART
XLNet
T5

This article originally appeared at my blog admantium.com.

GPT-1

GPT-1
Date	2018-02
Producer	OpenAI
Source	Improving Language Understanding by Generative Pre-Training, wikipedia.org
---------	-----

The GPT model family introduced a novelty to training LLMs. The training process consists of two stages. First, the pre-training stage is unsupervised learning on unlabeled data with a specific language modelling task. Second, the fine-tuning stage changed the models parameter to excel at specific target tasks. This approach is reflected in GPTs name - Generative Pre-Trained Transformer - and defined a new standard approach for large language modelling.

The model is decoder-only, 12 masked attention heads with 768 dimensions, and each is connected to a feedforward network with 3072 dimensions. This resulted in 117M parameters.

In the pre-training stage, the model was trained to optimize determining the chance of a token appearing at the presence of another ordered token in a context window. This training was done on two corpora: The Toronto Book Corpus (800M words), and WordBenchmark (1B words). The Word Benchmark corpus was shuffled at sentence level.

The fine-tuning stage consisted of three different tasks in which labeled data was used. These tasks are:

Textual entailment: Given a hypothesis, does the text entail or contradict the hypothesis, or are the two texts not related at all? This test is based on the dataset. The developers concatenated the premise and hypotheses with a delimiter token "$". Datasets: SNLI, MultiNLI, Question NLI, RTE, SciTail.
Sentence Similarity: Define the semantic similarity of two texts. Both texts were added and delimited, and their order was changed. MSR Paraphrase Corpus, Quora Question Pairs, and STS Benchmark.
Question Answering and Commonsense Reasoning: These datasets consist of a document, a question, and a set of possible answers. All of these were concatenated, with the delimiter between the questions and the answers. For question answering, the datasets RACE and Story Cloze were used.

After fine-tuning, following scores were achieved:

Benchmark	Value	GPT-1 Score
MNLI-m	ACC	82.1
MNLI-mm	ACC	81.4
SNLI	ACC	89.9
SciTail	ACC	88.3
QNLI	ACC	88.1
RTE	ACC	56.0
Stroy Cloze	ACC	86.5
RACE	ACC	59.0
-----------	-----	----

BERT

BERT
Date	2018-10
Producer	Google AI
Source	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, wikipedia.org
---------	-----

BERT is an encoder-only LLM originating from Google in 2018. Two language model versions exist: BERT base with 12 layers, 12 attention heads with 768 hidden dimensions, and a feed-forward network with 3072 dimensions, providing a total of 110m parameters. BERT large has 24 layers, 16 attention heads with 1024 dimensions, and 4096 feed forward filters, resulting in 340m parameters.

It was trained on the Toronto Book Corpus (800M words) and English Wikipedia (2500M words).

Only two training tasks were applied: masked language modelling, and next sentence prediction. In the masked language modelling task, 15% of the tokens were represented with the string [MASK], and the model needed to predict which word was masked. In the next sentence prediction, two sentences were connotated with the token sequence [CLS] sent1 [sep] sent2. The models task was to provide a probability if these two sentences were contained in the training data.

The resulting BERT models need to be fine-tuned for specific NLP tasks. The model architects choose a specific way to convert the benchmark datasets into internal formats, and then trained the model for 3 epochs for each individual task. The paper authors provided these results:

Benchmark	Value Type	BERT base Score	BERT large Score
GLUE	AVG	79.6	82.1
SQUAD 1.1	F1	88.5	90.9
SQUAD 2.0	F1	n.a.	83.1
SWAG	ACG	n.a.	86.3
---------	-----	---------	----------

The BERT model was released as Open Source. Its inception heralded the large-scale application of Transformers for NLP tasks. Developers and researchers used the model in several ways, and it became clear that NLP benchmarks, until then governed by complex, human-crafted rule systems, were broken with transformer models. This showed that unsupervised learning on huge amount of text produces models with surprising capabilities, with self-learned feature representations of natural languages.

RoBERTa

RoBERTa
Date	2019-09
Producer	Facebook AI
Source	RoBERTa: A Robustly Optimized BERT Pretraining Approach
---------	-----

This model enhanced the original BERT model by changing its hyperparameters and by providing different inputs during training.

Two versions of this model exist. RoBERTa Base, with 12 layers (768 dimensions), 12 attention heads, feed forward network with 3072 dimensions. And RoBERTa Large, with 24 Layers (1024 dimensions), and a feed forward network with 4096 dimensions.

For training, the following datasets were used:

Book Corpus, and all English Wikipedia articles, totaling 16GB.
CC News, a filtered dataset from the Common Crawl dataset (76GB),
Open Web Text, an archive of web pages rated with at least three recommendations from Redit (38GB)
Stories, another Common Crawl Dataset, to create Winograd schemes (31GB) (a Winograd schema is question about two noun phrases that contain an ambiguous pronoun, and the pronouns’ meaning needs to be determined.

Exactly as the original BERT model, the pre-training goals are masked language modelling and next sentence prediction. For fine-tuning on GLUE and SQUAD, the same approach as in BERT was used. For SQUAD v.2, an additional binary classifier was trained. The model was additionally trained for RACE. Its benchmark results are as follows:

Benchmark	Value Type	RoBERTa base Score
SQUAD 1.1	F1	90.4
SQuAD 2.0	F1	78.7
MNLI-m	F1	84.0
SST-2	F1	92.9
RACE	F1	64.2
---------	-----	------------

DistillBERT

DistillBERT
Date	2019-10
Producer	HuggingFace
Source	DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
---------	-----

This variant of the BERT model aimed to retain the same language understanding performance of the original BERT model, but reduce the size of the model by 40%. For this goal, the technique knowledge distillation was used, in which a student model is trained on a teacher models output, learning the output probabilities and thereby approximating the results of the teacher model.

Its architecture is a reduced version of BERT: 6 layers, 6 attention heads with 768 dimensions, and a feed-forward network. The resulting model has 66M parameters.
The model was trained on the same corpus as BERT: Toronto Book Corpus (800M words) and English Wikipedia (2,500M words). It was then fine-tuned on GLUE tasks and datasets.

Following benchmark scores are reported:

Benchmark	Value Type	DistillBERT Score
GLUE	AVG	77.0
CoLa	ACC	51.3
MNLI	ACC	82.2
MRPC	ACC	87.5
QNLI	ACC	89.2
QQP	ACC	88.5
RTE	ACC	59.9
SST-2	ACC	91.3
STS-B	ACC	86.9
WNLI	ACC	56.3
SQuAD 1.1	F1	85.8
---------	-----	-----------

BART

BART
Date	2019-10
Producer	Facebook AI
Source	BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
---------	-----

The BART model combines an encoder and a decoder neural network. Specifically, it consists of an bi-directional encoder, which reads the complete input stream in one go to create an interconnected representation of all tokens. And an autoregressive decoder that continuously produces probabilities for the best fitting-next token by considering all previous tokens as well as the token that was just predicted.

The BART base model consists of 6 encoder layers and 6 decoder layers with 768 dimensions and embedded feed-forward layers. The BART large model variant doubles the encoder and decoder layers to 12 each with 1024 dimensions.

The pre-training process is unique. To all input text, random noise functions were applied, including modifications like token masking, token deletion, text infilling (replacing spans fo texts with the [MASK] token), sentence permutation (shuffling sentences around), and document rotation (change the starting token in a document, try to guess the original document). There is no indication which corpus was used.

The fine-tuning stage of this model was also unique. For one, a large number of different benchmarks and their corresponding datasets were used: MNLI (textual entailment), SQuAD (question answering), Eli5 (long form question answering), ConvAI2 (dialogue response generation), CNN/DM (text summarization). And two, different training approaches were chosen that reflect a combination of noise function and the fine-tuning training objective: language model, permuted language model, masked language mode, multitask masked language model, masked sequence-to-sequence.

The fine-tuned version of BART, trained on multitask masked language modelling, achieved the following scores:

Benchmark	Value Type	BART base Score
SQuAD 1.1	F1	89.2
MNLI	F1	82.4
ELI5	F1	23.73
Xsum	F1	7.5
ConvA12	F1	12.39
CNN/DM	F1	6.74
---------	-----	-----

XLNet

XLNet
Date	2020-01
Producer	Google AI
Source	XLNet: Generalized Autoregressive Pretraining for Language Understanding
---------	-----

With XLNet, a new approach to pretraining was created: generalized autoregression. By extensive shuffling of the input data, learned language understanding could be generalized to a high degree. This leads to impressive results, and further enhances understanding of transformers, paving the way for advanced in the future.

The base model defines 12 attention heads/12 layers with 768 dimensions, and 2 layers of feed forward networks with 768 and 3072 dimensions each. XLNet large has 24 attention heads/24 layers with 1024 dimensions, followed by 2 layers of feed forward networks with 1024 and 4096 dimensions each.

The pre-training stage used content from 5 different sources: The Book Corpus and English Wikipedia (13GB,) Giga5 (16GB text) Clue Web 2012-B (19GB ), and Common Crawl (110GB). The model was then then fine-tuned for several NLP benchmarks with their corresponding datasets.

The model scored as follows:

Benchmark	Value Type	XLNet Score
SQuAD 1.1	F1	90.6
SQuAD 2.0	F1	89.7
CoLa	ACC	69.0
MNLI	ACC	90.8
MRPC	ACC	90.8
QNLI	ACC	94.9
QQP	ACC	92.3
RTE	ACC	85.9
SST-2	ACC	97.0
STS-B	ACC	92.5
---------	-----	-----

T5

T5
Date	2020-06
Producer	Google AI
Source	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
---------	-----

T5 is an acronym for Text-To-Text Transfer Transformer, which describes the paradigm how this LLM operates. By phrasing all pre-training steps as a general text-to-text mapping, several important improvements could be evaluated. First, datasets traditionally included during the fine-tuning stage only, can be applied in the pre-training phase as well. Second, all the models hyperparameters, its loose function etc. are equal for all training data, facilitating the creation of comparable models. Third, the models capability to perform in tasks that it has not been trained for is greatly enhanced.

The models architecture is similar to BERT: 12 encoder blocks and 12 attention heads with 768 dimensions, followed by a feed-forward network with 3072 dimensions. As with other models, several versions exist:

T5 small with 6 encoder blocks, 8 attention heads, 512 dimensions and FFN with 2048. This model hast 60M parameters.
T5 small with 12 encoder blocks, 12 attention heads, 512 dimensions and FFN with 3072. This model hast 220M parameters.
T5 Large with 24 encoder blocks, 16 attention heads, 1024 dimensions and FFN with 4096 dimensions. Resulting in 770M parameters.
T5 "3B: This variant has 32 attention heads and feed-forward network with 16384 dimensions, resulting in 3 billion parameters.
T5 "11B" This variant has 128 attention heads and feed-forward network with 65536 dimensions, resulting in 11 billion parameters.

The non-task specific corpus used by the model is called C4, the "Colossal Clean Crawled Corpus". It includes texts from Wikipedia as well as from Common Crawl. Input texts are aggressively filtered to remove any unwanted content.

To evaluate the model, the paper also provided different approaches: It can be used "as-is" because the typical fine-tune training sets were already included, or it could be further fine-tuned with the derived data sets. To make the results comparable to other models presented in this article, the following table lists the performance of the base model only:

Benchmark	Value Type	T5 large Score
SQuAD 1.1	F1	90.6
SQuAD 2.0	F1	93.70
CoLa	ACC	61.2
MNLI	ACC	89.9
MRPC	ACC	89.9
QNLI	ACC	94.8
QQP	ACC	89.9
RTE	ACC	87.2
SST-2	ACC	96.3
STS-B	ACC	89.2
---------	-----	--------

Conclusion

This article showed you the very start of LLM evolution. You learned the details of these models: GPT-1, BERT, RoBERTa, DistillBERT, BART, XLNet and T5, covering a time-span from 2018-02 to 2020-06. A steady evolution or architectures, training approaches and fine-tuning tasks started, yielding several trends that future models would inherit. These trends are a) generative pre-training on increasing amounts of texts, which leads to a rich stochastic model of language and token meanings, b) increasing number of encoder layer and attention heads, c) adopting the once separated fine-tuning tasks in the pre-training step to yield models that perform well on other tasks. The next articles continue the LLM overview with newer models.

Blog