OpenAI GPT-3 vs Specialized Models [Benchmark] - Should AI companies be really worried ?

OpenAI’s Objective is to create Artificial General Intelligence - “highly autonomous systems that outperform humans at most economically valuable work” .

They are planning on combining text, image and speech models.

In this article, we focus on one of their actual milestones, GPT-3 . Can it really achieve state of the art performance on any language task (compared to specialized models)?

OpenAI’s GPT3 : Open AI is challenging these companies (especially NLP models) with their GPT3 model.

Let’s ask ChatGPT (GPT-3 optimized for dialogue) what GPT-3 is ?

Specialized AI Models : There are a lot of AI companies training specialized models for specific tasks and providing access to them through APIs. Among witch big tech companies (Google, Amazon, Microsoft and IBM) as well as smaller companies focused on specific tasks (DeepL(translation) , Deepgram (speech) , Clarifia (vision) )

Large Language Models like GPT-3 should be able to compete with specialized models on a lot of Natural Language Processing (NLP) tasks without fine tuning. That's referred to as Zero Shot Learning . Let's verify that!

Benchmark

In order to verify that, we chose four tasks : Keywords Extraction, Sentiment Analysis, language detection and translation to benchmark GPT3 against other state of the art proprietary models from different companies.

We've done that using a single API : Eden AI. Code snippets will be available for each task so that you can reproduce the predictions yourself on your own data.

There is an Open Sourced version of EdenAI that you can ⭐ find on github ⭐ as a python module !

Please refer there for more information on the different APIs used in the benchmark.

Language Detection:

Language detection is simply the task of returning in what language a text is written.

1/ Dataset : We used an interesting dataset from huggingface with 20 languages : arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)

2/ Evaluation
We compare Open AI to Google GCP, Amazon AWS, and IBM.
We took a few hundreds of examples and evaluated the models performance on them with the Accuracy metric.

3/ Results
The results are shown below with OpenAI ranking 3rd out of the four AI providers we chose.

Amazon 96%
Google 95%
OpenAI 89%
IBM 87%

Sentiment analysis :

This tasks is about understanding the sentiment of the writer when writing a specific piece of text. It can be Positive Negative or Neutral .

1/ Dataset: Most of the datasets we've found did not include a “neutral” sentiment except for the Twitter Sentiment Analysis dataset from kaggle.

2/ Evaluation:
We compare Open AI to Google GCP’s, Amazon AWS’s and IBM’s APIs. using the Accuracy metric

3/ Results :
As shown below, once again, OpenAI gets the 3rd place.

Amazon 76%
Google 66%
Openai 61%
IBM 56%

Keyword extraction :

Keyword or Keyphrase Extraction is about being able to extract the words or phrases that most represent a given text.

1/ Dataset: we selected our datasets from the public github repository AutomaticKeyphraseExtraction
Most of the datasets listed there were too long for the 4k token limit of OpenAI so we had to go with the Hulth2003 abstracts dataset.
Since the different providers are trained to return keywords and keyphrases present in the original text, we did some cleaning to remove all keywords that were not present in the abstracts. We ended up with 470 abstracts.

2/ Evaluation
We compare Open AI to Microsoft, Amazon and IBM.
And we measured their performances using the average precision metric.

3/ Results : This time, OpenAI's GPT-3 was ranked last.

1/ Microsoft 0.6513312046679187
3/ IBM 0.6022276518997
2/ Amazon 0.4954784007523
4/ OpenAI 0.2598775421

Translation :

Automatically translate a text from a language A to a language B.

1/ Dataset : we chose a dataset from the Language Technology Research Group at the University of Helsinki’s Tatoeba Translation Challenge .
We took 100 of examples from different latin languages pairs : deu-fra, eng-fra, fra -ita, deu-spa , deu-swe which constitutes a 500 example test dataset.

2/ Evaluation
We compare Open AI to DeepL, ModernMT, NeuralSpace, Amazon AWS and Google GCP.
A lot of metrics exist for automatic machine translation evaluation. We chose COMET by Unbabel (wmt21-comet-da) which is based on a machine learning model trained to get state-of-the-art levels of correlation with human judgements. (read more on their paper ) .

3/ Results:
The scores are not interpretable but are used to rank machine translation models. And here again, OpenAI ranks last in this task.

DeepL : 0.19001633345126925
ModernMT : 0.17788391513374424
Amazon : 0.16483921567053203
Neuralspace : 0.163133354485786
Google : 0.16280640903935437
OpenAI : 0.15934198508564865

Conclusion

OpenAI's GPT-3's results are quite impressive. We’re getting closer to a kind of general (zero shot) nlp models multitasking without fine tuning.

But in practice, for the moment, if you need an API for one of the tasks we presented in the benchmark, OpenAI's GPT3 wouldn't be the best choice .
In addition to the fact of it not being the best performant model, the 4k input tokens limit can be problematic for reasonably long texts.

We still do need to closely watch the new models OpenAI is working on. As Sam Altman talked about in an interview, they are And implementing a continuous learning approach which would make their model constantly improving by feeding on the internet.

They are also planing on unifying their models to deal with multiple input types which resulting in a single model capable of analyzing any type of data.

Code Snippets :

Language Detection :

import json
import requests

headers = {"Authorization": "Bearer 🔑 Your_API_Key"}

url ="https://api.edenai.run/v2/translation/language_detection"
payload={"providers": "google,amazon,openai,ibm", 'text': "this is a test"}

response = requests.post(url, json=payload, headers=headers)

result = json.loads(response.text)
print(result['google'])
print(result['amazon'])
print(result['ibm'])
print(result['openai'])

Sentiment Analysis :

import json
import requests

headers = {"Authorization": "Bearer 🔑 Your_API_Key"}

url ="https://api.edenai.run/v2/text/sentiment_analysis"
payload={"providers": "google,amazon,ibm,openai", 'language': "en", 'text': "this is a test"}

response = requests.post(url, json=payload, headers=headers)

result = json.loads(response.text)
print(result['google'])
print(result['amazon'])
# ...

Keyword Extraction:

import json
import requests

headers = {"Authorization": "Bearer 🔑 Your_API_Key"}

url ="https://api.edenai.run/v2/text/keyword_extraction"
payload={"providers": "microsoft,amazon,ibm,openai", "language": "en", "text": "this is a test of Eden AI"}

response = requests.post(url, json=payload, headers=headers)

result = json.loads(response.text)
print(result['microsoft'])
print(result['amazon'])
#...

Translation :

import json
import requests

headers = {"Authorization": "Bearer 🔑 Your_API_Key"}

url ="https://api.edenai.run/v2/translation/automatic_translation"
payload={ "providers": "deepl,modernmt,neuralspace,amazon,google,openai", 
'source_language':"en", 
'target_language':"fr", 
'text': "this is a test"}

response = requests.post(url, json=payload, headers=headers)

result = json.loads(response.text)
print(result['deepl']['text'])
print(result['google']['text'])
# ...

Blog