Yash Jivani
Posted on June 25, 2024
Welcome to the era of Large Language Models (LLMs). If you've heard terms like "benchmarks" or "evaluations" and wondered what they mean & how to interpret them in the context of LLMs, you're in the right place. Let's break down these concepts to understand them.
Benchmarks
Benchmarks refer to standardized tests or tasks in the form of datasets used to evaluate and compare the performance of different models. These benchmarks often include various language understanding and generation tasks, such as text completion, question answering, and summarization. Benchmarks provide a measure of how well an LLM performs compared to others. Some of the benchmarks used to evaluate LLMs are -
- General Language Understanding Evaluation (GLUE) benchmark is a collection of 9 natural language understanding tasks
- MMLU (Massive Multitask Language Understanding) is a benchmark to measure knowledge acquired during pretraining. It covers 57 subjects across STEM
- HumanEval focuses on whether the LLM's generated code works as intended Detailed information can be found in [2] & [3].
Evaluations
Evaluation refers to measuring and assessing a model's performance and effectiveness in how accurately the model can predict or generate the next word in a sentence, understand context, summarize data, and respond to queries. Evaluation is crucial because it helps determine the model's strengths and weaknesses, and provides insights into areas of improvement. There are 2 different ways to compute metrics scores
Statistical Scorer
These are purely number-based scorers - i.e. they don't consider semantics into account. Some of these are -
a. BLEU (BiLingual Evaluation Understudy) evaluates the output of LLM application against annotated ground truths by calculation precision for each matching n-gram between actual & predicted outputs.
b. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is primarily used for evaluating text summaries from NLP models and calculates recall by comparing the overlap of n-grams between LLM outputs and expected outputs.Model-Based Scorer
These are NLP-based scorer that takes semantics into account. Some of these are -
a. BLEURT(Bilingual evaluation understudy with Representations from Transformers), often used for machine translation, uses pre-trained models like BERT to score LLM outputs on some expected outputs
b. NLI(Natural Language Inference) uses the NLP classification model to classify whether an LLM output is logically consistent (entailment), contradictory, or unrelated (neutral) with respect to a given reference text.
Detailed information can be found in [4].
After undergoing the benchmark’s evaluation, models are usually awarded a score from 0 to 100. These are the numbers you usually see when companies publish along with their LLM model to compare other models evaluated on the benchmark.
References
Posted on June 25, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.