GPT-3 Explained

Hello, Readers… California based AI research foundation named OpenAI, started by Elon Musk, Sam Altman, Greg Brockman, and a few other leaders in ML, recently released an API and website that allows people to access a new language model called GPT-3.

For comparison, the previous version, GPT-2, was made up of 1.5 billion parameters. The largest Transformer-based language model was released by Microsoft earlier this month and is made up of 17 billion parameters.

“GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic,” the researchers stated in their paper. “We find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans,” they add.

Natural language processing tasks range from generating news articles, to language translation, to answering standardized test questions.

OpenAI trains all of their AI models on the cuDNN-accelerated PyTorch deep learning framework.

Let me also recollect for your perusal that earlier this month Microsoft and OpenAI announced a new GPU-accelerated supercomputer built exclusively for the organization.

The original GPT, and GPT-2, are both adaptations of what’s known as a Transformer, an invention pioneered at Google in 2017. The Transformer uses a function called attention to calculate the probability that a word will appear given surrounding words. OpenAI caused controversy a year ago when it said it would not release the source code to the biggest version of GPT-2, because, it said, that code could fall into the wrong hands and be abused to mislead people with things such as fake news.

GPT-3 is essentially a context-based generative AI. This means that when the AI is given some sort of context, it then tries to fill in the rest. If you give it the first half of an essay, it will generate the rest of the essay.

“The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server.,” the companies stated in a blog.

Table of Contents

Let’s see the Key Takeaways of GPT-3
GPT-3 FAQ:
What is GPT-3?
Who created GPT-3?
What is the capacity of GPT-3
When was GPT-3 introduced?
Who founded OpenAI?
Let’s see the Key Takeaways of GPT-3
GPT-3 shows that language model performance scales as a power-law of model size, dataset size, and the amount of computation.

GPT-3 demonstrates that a language model trained on enough data can solve NLP tasks that it has never encountered. That is GPT-3 studies the model as a general solution for many downstream jobs without fine-tuning.

The cost of AI is increasing exponentially. Training GPT 3 would cost over $4.6M using a Tesla V100 cloud instance.

The size of state-of-the-art (SOTA) language models is growing by at least a factor of 10 every year. This outpaces the growth of GPU memory. For NLP, the days of “embarrassingly parallel” is coming to the end; model parallelization will become indispensable.

Although there is a clear performance gain from increasing the model capacity, it is not clear what is really going on under the hood. Especially, it remains a question of whether the model has learned to do reasoning, or simply memorizes training examples in a more intelligent way.

In terms of performance, the new GPT 3 model achieves near SOTA results on the SuperGLUE benchmark, introduced last year to test reasoning and other advanced NLP tasks. In other benchmarks, including COPA and ReCoRD, the model falls short with word-in-context analysis (WIC) and RACE, a set of middle and high school exam questions.

“Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems,” the organization said.

The simplest way to explain how it works is that it analyzes a massive sample of text on the internet, and learns to predict what words come next in a sentence given prior context. Based on the context you give, it responds to you with what it believes is the statistically most likely thing based on learning from all this text data. We have now steadily built up to where they are today, where a model like GPT 3 can complete several paragraphs or more.

Like GPT-2 and other Transformer-based programs, GPT 3 is trained on the Common Crawl data set, a corpus of almost a trillion words of texts scraped from the Web. “The dataset and model size is about two orders of magnitude larger than those used for GPT-2,” the authors write.

GPT 3 with 175 billion parameters is able to achieve what the authors describe as “meta-learning.” Meta-learning means that the GPT neural net is not re-trained to perform a task such as a sentence completion. Given an example of a task, such as an incomplete sentence, and then the completed sentence, GPT 3 will proceed to complete any incomplete sentence it’s given.

GPT 3 comes in eight sizes, ranging from 125M to 175B parameters. The largest GPT 3 model is an order of magnitude larger than the previous record-holder, T5-11B. The smallest GPT 3 model is roughly the size of BERT-Base and RoBERTa-Base.

All GPT 3 models use the same attention-based architecture as their GPT-2 predecessor. The smallest GPT 3 model (125M) has 12 attention layers, each with 12x 64-dimension heads. The largest GPT 3 model (175B) uses 96 attention layers, each with 96x 128-dimension heads.

GPT 3 expanded the capacity of its GPT-2 by three orders of magnitudes without significant modification of the model architecture — just more layers, wider layers, and more data to train it on.

Since Neural Networks are compressed and compiled version of the training data, the size of the dataset has to scale accordingly with the size of the model. GPT 3 175B is trained with 499 Billion tokens.

wait for OpenAI to reveal more details about the training infrastructure and model implementation. But to put things into perspective, the GPT 3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 years reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run.

Time is not the only constrain,. The 175 Billion parameters need 700GB memory to store in FP32. This is one order of magnitude larger than the maximum memory in a single GPU. To train the larger models without running out of memory, the OpenAI team uses a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on the part of a high-bandwidth cluster provided by Microsoft.

GPT 3 is able to learn how to do a task with a single prompt, better, in some cases, than versions of Transformer that have been fine-tuned, as it were, to specifically perform only that task. Hence, GPT-3 is the triumph of an over-arching generality. Just feed it an enormous amount of text till its weights are ideal, and it can go on to perform pretty well on a number of specific tasks with no further development.

That’s where the story comes to a striking denouement in the new paper. After listing off the impressive results of GPT-3 on language tasks ranging from completing sentences to inferring the logical entailment of statements to translating between languages, the authors note the shortcomings.

“Despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses.” Say the Authors.

Those weaknesses include an inability to achieve significant accuracy on what’s called Adversarial NLI. NLI, or natural language inference, is a test where the program must determine the relationship between two sentences. Researchers from Facebook and the University of North Carolina have introduced an adversarial version, where humans create sentence pairs that are hard for the computer to solve.

GPT-3 does “little better than chance” on things like Adversarial NLI, the authors write. Worse, has amped up the processing power of their system to 175 billion weights, the authors are not exactly sure why they’ve come up short in some tasks.

That’s when they come to the conclusion, cited above, that perhaps simply feeding an enormous corpus of text to a gigantic machine is not the ultimate answer.

Even more startling is the next observation. The whole practice of trying to predict what’s going to happen with language may be the wrong approach, the authors write. They may be aiming in the wrong place.

“With self-supervised objectives, task specification relies on forcing the desired task into a prediction problem,” they write, “whereas ultimately, useful language systems for example virtual assistants might be better thought of as taking goal-directed actions rather than just making predictions.”

The authors leave it for another time to specify how they’ll take on this rather fascinating potential new direction.

Thanks, Readers for reading this article on GPT-3. Do check out the demos of GPT-3 on YouTube & Also check out my other blogs on Tech With Gajesh.

Blog

Gajesh Naik

Join Our Newsletter. No Spam, Only the good stuff.

Related