A problem which we are commonly faced with at Axyom regards which LLM to use for various downstream tasks. While there are a range of evaluations, it’s not always clear which ones to look at and which will be relevant for your task.
In this blog post, I will cover a range of methods by which LLMs and downstream applications can be evaluated. The goal is not to cover specific benchmarks or metrics but to discuss common underlying methods undergirding the benchmarks. Our goal is not so much to draw conclusions as to provide the information needed to make an informed decision.
Human Evaluation
Human evaluation is the method of evaluation where language model output is passed to human evaluators to rate. Essentially, they take a small survey and decide how much they like the output. Sometimes this is based on common sense, such as when rating fluency or determining if text is offensive. Other times a reference answer is passed to the evaluators as well, especially if the evaluator may not know the correct answer.
Human evaluation is often done through crowdsourcing microwork platforms like Amazon’s Mechanical Turk. No special input is required for the machine. When we are measuring we often use Likert-like scales, ask binary questions or ask humans to identify parts of a response.
Pros:
- Highest accuracy: Humans are adept at understanding context and nuances, ensuring alignment with desired objectives.
- Diverse metrics: Human evaluation allows for a wide range of qualitative and quantitative metrics.
Cons:
- Subjectivity: Different evaluators may have varying preferences and opinions, which can affect consistency, particularly in assessing potentially offensive content.
- Cost: Paying human evaluators is significantly more expensive than using automated methods.
- Speed: Manual reviews are slower compared to machine-based evaluations.
Programmatic Evaluation/Unit Testing (HumanEval Benchmark)
For evaluating code, programmatic methods like unit testing check whether the generated program correctly performs specified tasks. The popular HumanEval benchmark runs LLM-generated code within a sandbox with a specified set of unit tests and/or expected correct answers. If the LLM-generated code is correct, we can count this as a win for the LLM. If not, that indicates that more work needs to be done. Usually, for LLM-generated code, code is generated multiple times and the highest scoring code is ultimately chosen.
More broadly, similar approaches can be applied to test AI’s ability to interface with tools or systems. We can create small automated programs that the LLM can interface with and check whether it is able to complete specific tasks.
Pros:
- Naturalistic Setting: Provides a realistic evaluation environment where the AI must perform tasks similar to real-world applications.
- Automation: Allows for fully automated testing, reducing the need for human intervention.
- Task-Specific Efficiency: Works exceptionally well for evaluating specific tasks where outcomes are clear-cut and measurable.
Cons:
- Task Dependency: This method is highly dependent on the type of task and works best for well-defined, specific tasks such as coding.
- Specialized Requirements: Non-coding evaluations may require specialized programs or tools, adding complexity to the evaluation setup.
- Scope Limitations: May not be suitable for more general or open-ended tasks where outputs are not easily validated through unit tests.
ELO System (Chatbot Arena)
Regardless of the testing program, some common flaws remain. Any fixed dataset we choose can be gamed, perhaps not deliberately but implicitly. To account for this, a separate evaluation dataset is ideally held secret and not used until we are ready to ship the product. This is also a problem when it comes to public leaderboards. In addition to explicit training on the test set, public leaderboards allow LLM authors to pick models that seem to perform well on the test set setting up an unfortunate regression to the mean in practice.
Due to the overall weaknesses of each of these evaluation methods, ChatbotArena was created to formalize side-by-side comparison of various chat-based language models. Users of this site pose a question to two anonymized LLMs and then vote on which LLM answered better. This allows for the creation of an ELO score from head-to-head rankings which can then be used to evaluate chatbots in a variety of contexts.
Pros:
- Impossible to Game: People creating LLMs have no idea what users of ChatbotArena will ask and will not be able to game the system.
- Captures subtleties: Subtleties of what make a chatbot “good” may not be accurately captured by any given test but can still be felt. A number of chatbots do well on benchmarks but nevertheless lack creativity or basic appeal in responses.
Cons:
- Lack of Task Specificity: Chatbot Arena users are not evaluating on any specific task. While this can be mitigated with an inhouse system, the default system allows users to pose just about any question which may make it hard to distinguish between LLMs that are each good at one specific task.
- Volunteer Labor / Cost: To evaluate an LLM system, you can rely on the volunteers on ChatbotArena or try to set up your own. In the former case, you will be selecting for a very specialized group of people and in the latter case you will likely need to spend a fair bit of money to overcome network effects.
- Subjectivity: Ultimately, ChatbotArena scores are subjective. While we can hope for a wisdom of crowds, it is important to note that ELO systems exacerbate the subjectivity found in Likert-type scales by diluting the number of participants and allowing for free reign by users. How much this matters is ultimately unknown.
Multiple Choice Questions (MMLU, various)
Another way that LLM based models can be evaluated is by giving them multiple choice tests. The multiple-choice nature of tests makes them easy to grade. Examples of this type of test can be found in a number of BigBench benchmarks and the popular MMLU.
These sorts of questions are easy to grade and allow for automated testing whenever new model builds are created. The realm of automated testing doesn’t just allow for multiple choice questions – other questions with similar definite answers also work well in this regard. The answer can just be evaluated by simple comparison of sampled answers but in the case of multiple-choice questions, the generated next-token vector gives an effective probability of each of the multiple-choice answers. The cross entropy of this probability vector can be compared to the one-shot encoded multiple-choice answer.
Pros:
- Accuracy and speed: Combines the precision of human evaluation with the speed of machine processing.
- Automation: Can be quickly executed with each new model release.
Cons:
- Limited scope: Questions must have definite answers, which may not fully test a model’s general capabilities.
- Non-naturalistic problems: Many questions may not reflect real-world usage scenarios.
- Memorization issues: Models might recognize and reproduce answers from their training data.
NLP similarity metrics
In terms of questions with definite answers that are more complex than multiple choice, n-gram similarity metrics such as BLEU can be employed to tell whether a generated answer has the same words or pairs of adjacent words or triples of adjacent words as a sample answer. While this is sensitive to phrasing, use of n-gram metrics can distinguish between wrong word orders while still giving partial credit. These sorts of similarity metrics – ROUGE and BLEU – are present in a number of BigBench metrics and are often used to evaluate machine translations.
Pros:
- Speed: Allows rapid machine-based testing.
- Natural language compatibility: Suitable for evaluating natural language question/answer pairs.
- Interpretable and deterministic: Produces clear, repeatable results.
Cons:
- Sensitivity to phrasing: Lacks semantic understanding and may penalize correct answers due to synonym usage.
- Potential for rejecting valid answers: May overlook correct responses that are phrased differently from the reference.
Neural Machine Models
BLEURT is a neural model trained to replicate human ratings of text quality. It leverages transfer learning to evaluate outputs on novel datasets, providing a balance between human evaluation and traditional n-gram metrics. Other similar metrics can be created for a number of datasets and have been tried. These models increase complexity compared to n-grams but hopefully can leverage semantic understanding for higher accuracy.
Pros:
- Speed: Faster than human evaluation, allowing for quicker assessments.
- Efficiency: Converts a small amount of human evaluation data into a more robust model that can generalize to new datasets.
- Accuracy: Can potentially achieve high correlation with human judgments by learning from human-rated examples.
Cons:
- Interpretability: The model’s decisions can be opaque, making it hard to understand why a particular rating was given.
- Overfitting: There is a risk that the model might overfit to the training data, reducing its effectiveness on new, unseen data.
- Technical Complexity: Implementing and fine-tuning neural models like BLEURT can be technically challenging and resource-intensive.
LLM Machine Grading (GPT-4/Claude 3)
This approach involves using general-purpose language models like GPT-4 or Claude 3 to evaluate answers against a reference answer through a prompt-based system. Models like GPT4 and Claude 3 have been tested and found to have good agreement with human graders which makes these an appealing option for many use cases. However, if you are using GPT4 or Claude for your task, both are known to prefer their own outputs over others. While this may not be a huge issue, it is likely just the tip of the iceberg of subtle biases given the closed-source and uninterpretable nature of these tools. There is a lot we just don’t know about this method and
Pros:
- Ease of Implementation: APIs for GPT-4 and Claude 3 make it straightforward to set up and use.
- Accuracy: In some test cases, these models have shown superior grading accuracy compared to human evaluators. (For full transparency, this is somewhat speculative and other studies suggest there may not be as close of an alignment with human graders as was previously thought.)
- Speed: Automated grading is much faster than manual human evaluation, enabling quicker turnaround times.
Cons:
- Compounding Errors: Problems where GPT-4 and Claude 3 cannot provide a satisfactory answer can give misleadingly rosy outlooks.
- Lack of Interpretability: The reasoning behind the model’s grading decisions can be unclear.
- Cost: Using advanced models like GPT-4 or Claude 3 can be expensive, particularly at scale.
- Inconsistent Score Ranges: Likert type scores output by GPT-4 do not correlate well with actual performance in terms of measurable metrics (found by looking at texts with spelling mistakes)
- Bias Toward Own Outputs: These models may show a preference for responses similar to their own generated text, which can introduce bias.
- Novelty: These methods are relatively new and may still have undiscovered limitations or require further validation.
Baseline Metrics
This section is intended as a brief catch-all. LLMs applications can also be tested with a number of internal metrics such as perplexity or consistency. These metrics are largely applicable to base models and have some correlation with accuracy as can be seen through decreasing perplexity graphs.
Pros:
- Unlabelled Data: No labelled data is required to determine metrics such as perplexity.
Cons:
- Lack of applicability: While perplexity is an important metric to judge various LLMs, the relationship between perplexity and downstream tasks is far from straightforward.
Conclusion
We hope this blog post has given you some more insight into the world of LLM benchmarking. Each of the methods covered can give some insight into the performance of an LLM or downstream LLM applications. Ideally, by experimenting with different models and datasets using these methodologies, you can draw your own well-informed conclusions.