TLDR: Evals make sense for ranking different base models in arbitrary units, and have their place in testing, but the premise of using them to guarantee software performance is flawed.
What are evals?
Evals (evaluations) refers to test-based performance measurement of AI systems. For example, in a customer service chatbot, an eval could entail prompting the chatbot with a set of customer queries, scoring the results (based on helpfulness, accuracy, etc.) and aggerating the scores to give a picture of how well the chatbot performs.
There's been a proliferation of evaluation tools – both stand alone and incorporated into observability platforms or app building tools. My impression is that the dominant mode of building with AI is still ad-hoc development without systematic testing ("prompt and pray"); however, the conventional wisdom is that evals are essential to responsible AI development.
Why are evals being used?
Like all things AI, the focus with evals feels to be about the how, and not the why. Some business reasons one might do evals are:
-
To comply with a standard
Various standards or regulations (the EU AI Act for example) mandate testing in some instances. This doesn't really answer the root cause of why testing is useful, but compliance is a legitimate business need, so organizations test their models. -
Performance testing
Like the earlier example of a chatbot, evaluation can be seen as a proxy for testing the performance of an AI system. For base models like GPT and Claude, performance benchmarks are used for comparison with previous offerings and those of competitors. For chat systems, there are common evaluators like RAGAS that claim to test how good the system is. Arguably, performance tests act as a stand-in for more costly user research. It's been long known to be a poor substitute. -
Red teaming
Red teaming can be thought of as stress testing to look for weaknesses or problems. It is distinct from performance measurement in that the goal should be to find root cause, rather than evaluate statistical performance. Red teaming is often seen in the context of security testing, but can also apply for performance, for example checking if a chatbot suffers from common known failure modes. -
Overfitting Prompt engineering and A/B testing
Evals provide a standard way to test the effect of changes to system design, such as prompts, retrieval, etc. to optimize performance. A risk is that the system can be overfit to the eval set.
What is the problem with evals?
Evals alone do not provide useful performance information, let alone guarantees. It's hard or impossible to do them well.
Key Issues:
-
Data
Comprehensive test data is hard to come by for real use cases, and so evals typically use either a synthetic dataset (often generated with AI) or a smaller hand-crafted data set. While it's not impossible to build comprehensive data sets for a given task, it's often cost prohibitive (and requires stepping away from the keyboard into the real world) and so the industry favors synthetics data that is often unrealistic, or generic benchmarks that aren't representative of the specific task being judged. -
Scoring
Evaluating a nontrivial task (like the answer quality of a chatbot) is hard and expensive, so automated methods are favored (e.g. LLM-as-a-judge). This place a limit on the quality of the scores, and the type of issues they can catch, and recurse the evaluation problem into "judging the judge" without really solving it. -
Systems
Every deployed AI tool used in a business application is a system that includes an AI model (like GPT-4) and additional architecture like databases, prompt chains, etc. Evaluating the model alone provides limited information about whether the system will be fit for purpose. Yet the majority of evals target the base model. This is understandable because it allows standardization and comparison between base models, and the evals are designed for those building base models. But it's irrelevant to end uses. So when we see model benchmarks for legal knowledge or finance or medical knowledge, these are red herrings that don't give information about how a system build on top of the model will perform. -
Failure Analysis
Evals overwhelmingly aggregate results into some kind of average that ignores the nature and severity of the errors or failures that are present. If a chatbot answers with 95% accuracy, that could mean that every response is substantially correct, or that 19/20 are great and 1/20 responses is a blatant lie. On evals related to security or appropriate language, we often see minor issues that would have little commercial importance aggregated with major ones that could present an existential problem.
The long tail
AI is a long tail problem: the distribution of possible inputs (such as questions that a chatbot can be asked) contains regularly occurring outliers that can never be captured in an evaluation data set. New circumstances will always arise in production that have not been tested for. So, no matter how thoroughly a system has been evaluated, evals alone cannot demonstrate the absence of problems. This comes up regularly when new models are released and reddit instantly finds some embarrassing example of model behavior.
The wrong construct
The long tail problem leads naturally to the biggest problem with evals. In LLMs, we build these intractably complex, ersatz sentiences that we don't understand, stand them up in some domain specific task, then try and test them to see if they behave. I'd speculate that this way of doing things arose organically out of academic machine learning practices. Simpler predictive ML models like you might encounter in banking or advertising are essentially curve fitting. We fit some data and then we do an eval to see how good the fit is.
But we've taken this idea an applied it to validating the performance of all-powerful LLMs that are capable of basically any language construction and hoping that by adding a prompt that says "You're a helpful assistant, only talk about the FAQ on our dog sitting service website" is enough to keep it on task. It's in these low-breadth applications (like dog sitting FAQ) where the eval approach to AI system development seems most egregious. The idea that somehow a chatbot that should be saying things like "sorry we don't accept Mastiffs as our service is only available for dogs under 100kg" is under-the-hood actually capable of writing python code or write a short essay about how the moon landing was faked.
No other software is designed this way – although ironically it may be, with LLM based coding tools. The fact that LLMs are nondeterministic (if that's true) is no excuse. Software is regularly built around human and environmental factors that add randomness without discarding development practices in favor of evals.
Conclusion
As typically implemented, evals aren't great. This is largely due to their false appeal as a substitute for proper testing with users. In reality, every shortcut offered in data generation, automatic scoring, etc. chips away at the evals' usefulness. Even a good implementation of evals suffers from the long tail problem which is itself one symptom of the root issue of the kind of test-and-patch software development that is being used with LLMs.
There are cases where none of this matters. Where unpredictable behavior can be tolerated, and the stakes are low. And evals have their place as discussed earlier. But for anything where consistency and predictability are critical, using evaluation as any kind of guarantee of behavior simply doesn't work.