survival8: Evaluation Methodology - AI Engineering by Chip Huyen (Chapter 3)

Download Book

Outline

Introduction: Why Evaluation Matters
The Unique Challenges of Evaluating Foundation Models
Language Modeling Metrics Explained
- Entropy, Cross-Entropy, Perplexity, BPC/BPB
Exact Evaluation Methods
- Functional Correctness
- Reference-Based & Similarity Metrics
- Embeddings & Semantic Similarity
AI as a Judge: Can AI Evaluate AI?
- How it works, benefits, limitations, biases
Comparative Evaluation & Ranking Models
Future Directions & Closing Thoughts

Section 1: Introduction – Why Evaluation Matters

Artificial intelligence is no longer an experimental technology confined to research labs—it now shapes how we search for information, get medical advice, interact with businesses, and even how we think about creativity. But with this rapid integration into everyday life comes a sobering truth: AI systems fail, and when they do, the consequences can be severe.

In recent years, we’ve witnessed alarming examples of AI gone wrong. A man tragically took his own life after prolonged conversations with a chatbot that encouraged self-harm. In another case, lawyers submitted fabricated legal citations hallucinated by an AI assistant, only to face sanctions in court. Air Canada was even ordered to pay damages when its AI-powered customer support chatbot misled a passenger. These stories aren’t outliers—they highlight an underlying risk: without robust evaluation methods, AI can produce outputs that look confident but are dangerously unreliable.

The irony is that as teams rush to build AI applications, they often discover that the hardest part isn’t training the model or deploying the infrastructure—it’s figuring out how to measure whether the system works as intended. For many projects, evaluation consumes the majority of development time. It’s not enough to build a model that “sounds good”; it must be trustworthy, consistent, and safe for real-world use.

Why is evaluation so difficult? Unlike traditional machine learning models, where outputs are often discrete and measurable (e.g., “Is this image a cat or not?”), foundation models like GPT or LLaMA are open-ended. They don’t just give a single correct label—they generate text, code, summaries, or even images, where multiple answers can be valid. This introduces a fundamental challenge: how do you decide whether one answer is “better” than another?

Compounding this difficulty is the fact that many foundation models are black boxes. Developers usually don’t know exactly what training data was used, how the architecture is designed, or what biases are baked into the system. The only way to assess these models is to observe their outputs, which makes evaluation both essential and frustratingly imprecise.

Another problem is the rapid obsolescence of benchmarks. In AI, a benchmark is a dataset used to measure model performance. But as models improve, benchmarks get “saturated” quickly. For example, the GLUE benchmark, once considered a gold standard for natural language understanding, became too easy within a year of release. Its successor, SuperGLUE, met the same fate not long after. Similarly, popular benchmarks like MMLU have had to evolve into harder versions to keep pace with foundation models’ capabilities. This means evaluation is not a one-time task—it’s a moving target that must evolve alongside AI progress.

Despite these challenges, evaluation is not a hopeless task. In fact, the growing complexity of AI has spurred innovation in evaluation methods, from classic metrics like perplexity to novel approaches such as “AI as a judge,” where models are used to evaluate other models. These methods don’t just mitigate risks—they also uncover opportunities by showing us what models can (and cannot) do, sometimes even beyond human capability.

As this blog series unfolds, we’ll dive into the methodologies that engineers, researchers, and product teams use to evaluate AI systems. From the math behind perplexity, to practical tools like embedding similarity, to the controversial but promising use of AI itself as an evaluator—we’ll explore the evolving landscape of evaluation and why it sits at the heart of AI engineering.

Ultimately, evaluation is not about proving that AI is perfect. It’s about building confidence—for developers, for businesses, and for the public—that these systems can be deployed safely and responsibly. Without evaluation, AI remains a gamble; with it, AI becomes a tool we can actually trust.

Section 2: The Unique Challenges of Evaluating Foundation Models

Evaluating traditional machine learning models was never simple, but it was at least bounded. If you built a spam filter, you could test it against a labeled dataset: emails marked “spam” or “not spam.” Accuracy, precision, and recall gave you a reasonable sense of how well the system worked. Foundation models, however, belong to a different category. They are not narrowly trained for a single task; they are general-purpose, open-ended systems capable of tackling countless problems, many of which the creators themselves never anticipated. This shift fundamentally changes the nature of evaluation.

Let’s break down the major challenges:

1. The Intelligence Paradox

The smarter a model becomes, the harder it is to evaluate. A first grader’s math homework is easy to check—errors are obvious. But imagine reviewing a solution to a graduate-level topology problem. Even if the reasoning looks sound, verifying it could require deep domain expertise. Foundation models increasingly operate at this “graduate student” level. For instance, an AI might generate a flawless-sounding legal argument or scientific summary, but unless you’re an expert (and have the time to fact-check every claim), it’s almost impossible to know whether it’s correct.

This paradox means that as AI advances, human evaluators become less qualified to judge outputs, creating a bottleneck. In fact, some researchers have joked that if it already takes the brightest minds alive to evaluate today’s models, who will evaluate the AIs of tomorrow?

2. Open-Endedness and the Ground Truth Problem

In classic ML, ground truth is straightforward. If the correct label for an image is “cat” and the model says “dog,” the model is wrong. But open-ended tasks don’t work that way. If you ask a model to summarize War and Peace, there is no single “correct” summary. Dozens of different outputs could be valid, each with different emphases. Similarly, in translation, a French phrase like “Comment ça va?” could be translated as “How are you?”, “How’s it going?”, or “How are things?” Which one is the ground truth?

The lack of a clear, exhaustive reference set makes evaluation inherently subjective. At best, we can approximate correctness through similarity metrics or human preference data—but even these have limitations.

3. The Black-Box Nature of Foundation Models

Most foundation models are proprietary. Companies like OpenAI, Anthropic, and Google rarely disclose details about their models’ architectures, training datasets, or fine-tuning processes. Even open-source models can be so complex that understanding their internals is impractical for most developers. This black-box status means evaluation relies almost entirely on outputs, without visibility into why a model behaves the way it does.

It’s like judging a chef by tasting a dish without knowing the ingredients or cooking process. You might learn something, but you’ll never see the full picture of strengths, weaknesses, and biases.

4. Benchmark Saturation

Benchmarks have long been the yardstick of AI progress. But with foundation models, benchmarks quickly become obsolete. When GLUE was released in 2018, it was considered a major step toward testing natural language understanding. Within a year, models maxed out its scores, forcing the creation of SuperGLUE. Other benchmarks—like MMLU (Measuring Massive Multitask Language Understanding)—also faced saturation, prompting more challenging successors like MMLU-Pro.

This rapid obsolescence highlights the cat-and-mouse game between AI progress and evaluation benchmarks. Each time researchers design a new test, models leapfrog it within months. The result is a never-ending chase to stay ahead, making evaluation costly and resource-intensive.

5. Expanding Scope of Evaluation

In the past, evaluating a model meant checking performance on the task it was trained for. A translation model was evaluated on translation. A vision model on image classification. Foundation models, however, are general-purpose. They don’t just answer questions—they write poems, debug code, generate marketing copy, and even help discover new protein structures.

This broad scope means evaluation is not just about verifying known tasks. It’s also about discovering new capabilities—things the model can do that we didn’t expect. This requires not only measuring performance but also probing for strengths, limitations, and emergent behaviors. Sometimes, evaluation becomes a tool for exploration rather than just validation.

6. Time and Cost of Evaluation

Evaluating sophisticated outputs takes time. Unlike checking a classification label, assessing a generated essay, poem, or legal brief requires careful reading, reasoning, and sometimes subject-matter expertise. This can be more labor-intensive than building the system itself. Human evaluation is expensive and slow, while automated evaluation is often incomplete or misleading.

In short, the evaluation of foundation models is harder, more subjective, and more resource-intensive than traditional ML evaluation. Yet it is also more critical than ever. Without proper evaluation, we risk deploying systems that are unreliable, unsafe, or misleading at scale.

The good news is that researchers and practitioners are developing new tools and methodologies to address these challenges. The next step in our exploration is to understand the metrics that power language models, which remain the backbone of most foundation models. These metrics—entropy, cross-entropy, perplexity, and their relatives—may sound mathematical, but they are essential for understanding how well a model predicts, compresses, and generalizes from language.

Section 3: Language Modeling Metrics Explained

Before we dive into modern evaluation methods, it’s worth revisiting the fundamental metrics of language modeling. Most foundation models—including GPT, LLaMA, Claude, and their peers—are built on top of language models. Even when they’re multimodal, text prediction remains their core mechanism. That means the quality of these models is often tied to how well they predict the next word or token in a sequence.

This is where metrics like entropy, cross-entropy, perplexity, and bits-per-character/byte (BPC/BPB) come in. Though rooted in information theory and probability, these metrics are not just mathematical abstractions. They serve as vital guides for training, fine-tuning, and evaluating the predictive strength of models. Let’s unpack them in plain language.

1. Entropy – The Measure of Uncertainty

Entropy, in information theory, measures how much “surprise” or unpredictability exists in a dataset. The more unpredictable the data, the higher its entropy.

Take a simple example: suppose you’re trying to predict whether a coin flip lands on heads or tails. With a fair coin, there’s equal chance of either outcome—50/50. That means entropy is high: you can’t really predict what comes next. Now imagine a biased coin that always lands on heads. Entropy here is zero, because the outcome is perfectly predictable.

When applied to language, entropy tells us how much information is carried by each token (word, character, or sub-word). A language with very strict rules and few variations has lower entropy. For example, HTML code is far more predictable than a casual conversation between friends.

2. Cross-Entropy – Measuring Prediction Accuracy

Entropy tells us about the dataset itself. Cross-entropy tells us how well a model has learned to approximate that dataset.

Imagine you’re teaching a student to predict coin flips. If the true distribution is 50/50 but the student always predicts “heads,” the student’s guesses diverge from reality. Cross-entropy quantifies this divergence.

In language modeling, cross-entropy captures how closely the probability distribution predicted by the model aligns with the true distribution of the training data. Lower cross-entropy means the model is better at predicting the correct next token.

Formally, cross-entropy is the sum of two parts:

The inherent entropy of the dataset (its unpredictability).
The divergence between the model’s predictions and reality (the “gap” in learning).

When training a model, the goal is to minimize cross-entropy.

3. Bits-Per-Character (BPC) and Bits-Per-Byte (BPB)

One complication arises: different models tokenize text differently. Some use characters, others use sub-words, others whole words. That makes raw cross-entropy values hard to compare across models.

To standardize, researchers introduced bits-per-character (BPC) and bits-per-byte (BPB). These normalize entropy and cross-entropy based on how many bits are needed to encode each unit.

For example, if a model needs 3 bits to represent each character, we can compare it fairly to another model with a different tokenization scheme. BPB goes one step further by tying the metric to bytes, which are consistent across encoding systems like ASCII or UTF-8.

Beyond being an evaluation tool, BPC/BPB also reflect how well a model compresses text. A model with low BPB can compress its training data more efficiently, which is a sign of strong predictive power.

4. Perplexity – A More Intuitive Metric

If cross-entropy feels abstract, perplexity makes it more tangible. Perplexity is simply the exponential of cross-entropy.

Think of perplexity as the number of equally likely choices the model thinks it has at each prediction step. For example:

A model with perplexity = 2 is as uncertain as flipping a fair coin.
A model with perplexity = 10 means it effectively has 10 equally likely options at each step.
Lower perplexity means the model is more confident and accurate.

Perplexity is widely reported in AI research papers because it’s easier to interpret. A perplexity of 3, for instance, suggests the model is about as uncertain as choosing one option from three possibilities.

5. Why These Metrics Matter

At first glance, these metrics may seem academic. After all, when deploying a chatbot, you care about whether it answers questions correctly—not about its cross-entropy score. But these metrics matter because:

They guide training. Minimizing cross-entropy (and therefore perplexity) helps developers know whether a model is learning effectively.
They correlate with downstream performance. A model that predicts text well usually performs better on tasks like summarization, translation, or question answering.
They detect data leakage. If a model has suspiciously low perplexity on a benchmark dataset, it might have seen that dataset during training—making its performance results less trustworthy.
They support data cleaning. High-perplexity examples often indicate noisy or unusual data (e.g., gibberish or rare edge cases). This can help in deduplication and dataset curation.

6. The Limits of Perplexity

It’s important to note that perplexity isn’t everything. In fact, post-training techniques like reinforcement learning from human feedback (RLHF) can improve how “helpful” or “aligned” a model is, but often at the cost of higher perplexity. Similarly, quantization (shrinking model precision to save memory) can shift perplexity scores in ways that don’t necessarily reflect real-world usefulness.

That’s why perplexity and its cousins are best seen as foundational metrics, not final verdicts. They tell us how well a model predicts language at the token level—but they don’t capture higher-order qualities like truthfulness, reasoning ability, or creativity.

In summary, entropy, cross-entropy, perplexity, and related metrics are the compass points of language modeling. They don’t tell the whole story, but they establish the baseline. To move beyond prediction accuracy into real-world usability, we need more nuanced evaluation methods—ones that consider function, similarity, and semantics.

That’s exactly where we’ll turn next: Exact Evaluation Methods that judge not just whether text is probable, but whether it’s useful, correct, and aligned with human expectations.

Section 4: Exact Evaluation Methods

If entropy, cross-entropy, and perplexity measure a model’s statistical ability to predict language, exact evaluation methods take things a step further. Instead of asking “How well does the model fit the training data?” these methods ask “Does the model’s output actually solve the task at hand?”

This is especially critical for open-ended outputs, where correctness isn’t always obvious. Exact evaluation aims to reduce ambiguity by creating measurable, reproducible tests. Let’s explore the key approaches.

1. Functional Correctness – Does It Work?

The most straightforward way to evaluate an AI system is to check whether it performs the intended function.

For example:

If you ask a model to generate a Python function for calculating the greatest common divisor (GCD), does the code run and return the correct result across multiple test cases?
If you task a model with writing SQL queries, does the generated query run successfully and return the expected rows?

This approach, known as functional correctness, is widely used in benchmarks like OpenAI’s HumanEval or Google’s MBPP (Mostly Basic Python Problems). These benchmarks test AI-generated code by running it against predefined unit tests. A solution “passes” only if it succeeds in all cases.

The strength of functional correctness is that it’s objective and automatable. Either the code runs as expected or it doesn’t. No human interpretation is needed. Similar logic applies to game-playing bots (measured by score), optimization tasks (measured by efficiency), or scheduling tasks (measured by resource savings).

However, this method is limited to tasks where success can be clearly defined and tested. Writing a poem, generating marketing copy, or summarizing a book can’t be reduced to a simple “pass or fail.” For these tasks, we need alternative strategies.

2. Reference-Based Evaluation – Comparing to Ground Truth

When functional correctness isn’t possible, the next best thing is to compare model outputs against reference data.

For example:

In machine translation, a French-to-English model’s output can be compared against human translations.
In summarization, a generated abstract can be compared against gold-standard summaries written by experts.

Each example pairs an input (e.g., French sentence) with one or more reference responses (e.g., English translations). The model’s output is then scored based on how similar it is to these references.

But there’s a catch: language is flexible. If the reference translations are “How are you?” and “How’s it going?”, a model output of “How are things?” is arguably correct but might still be penalized if it doesn’t match references exactly. That’s why evaluation in this space splits into three categories:

Exact Match: The strictest measure—output must match a reference word-for-word. This works for trivia-style tasks (“Who was the first woman to win a Nobel Prize?” → “Marie Curie”), but quickly breaks down for open-ended outputs.
Lexical Similarity: Measures word overlap between outputs and references. Popular metrics here include BLEU, ROUGE, METEOR, and CIDEr. These were dominant in the pre–foundation model era, especially in machine translation and image captioning. They’re simple, fast, and objective—but brittle. Two semantically correct outputs can score very differently just because of phrasing differences.
Semantic Similarity: Instead of counting overlapping words, this method compares the meaning of outputs and references, often using embeddings (we’ll dive deeper into this shortly).

Reference-based evaluation is widely used but comes with trade-offs. It requires curating high-quality reference datasets, which is expensive and time-consuming. Worse, poor-quality references can lead to misleading evaluations. In fact, recent translation benchmarks have uncovered flawed reference translations that skew results, showing that even “gold standards” aren’t always reliable.

3. Embedding-Based Evaluation – Measuring Meaning

One of the most exciting advances in evaluation is the use of embeddings—numerical representations of text (or other modalities) that capture semantic meaning.

Here’s how it works:

Convert both the model output and reference output into embedding vectors (using models like BERT, Sentence Transformers, or OpenAI’s embedding APIs).
Compare the vectors using similarity metrics such as cosine similarity.
The closer the vectors, the more semantically similar the texts are.

For example:

“What’s up?” and “How are you?” have different words but very similar embeddings, so they would score highly on semantic similarity.
“Let’s eat, grandma” and “Let’s eat grandma” share many words lexically but have very different meanings. Embedding similarity can catch this nuance.

Metrics like BERTScore and MoverScore have become popular because they move beyond surface-level overlap and align better with human judgment.

The advantages of embedding-based evaluation include:

Flexibility across languages and phrasing.
Ability to compare across modalities (e.g., text and images with CLIP).
Reusability for other tasks like clustering, ranking, or anomaly detection.

However, the quality of results depends heavily on the embedding model used. Poor embeddings can distort similarity scores, and computing embeddings at scale can be computationally expensive.

4. Strengths and Weaknesses of Exact Evaluation

✅ Strengths: Objective, reproducible, automatable (at least for certain tasks).
❌ Weaknesses: Limited to tasks with clear ground truths or high-quality references. Struggles with subjective or creative tasks.

Why Exact Evaluation Still Matters

Despite its limits, exact evaluation forms the foundation of AI evaluation pipelines. When possible, functional correctness provides the most reliable measure of performance. When that’s not feasible, reference-based and embedding-based methods provide structured ways to approximate correctness.

Yet even with these, we face challenges. Many tasks remain inherently subjective. For example, what makes one essay “better” than another? What makes a joke funny or a story compelling? These gray areas can’t always be captured with unit tests or similarity scores.

This brings us to one of the most controversial but rapidly growing areas in AI evaluation: using AI itself as the evaluator. Instead of asking humans to decide which output is better, can we let a model judge?

That’s the subject of our next section.

Section 5: AI as a Judge – Can AI Evaluate AI?

So far, we’ve looked at traditional approaches to evaluation: functional correctness, reference-based methods, and embedding similarity. These are powerful tools, but they still have major limitations. They either require carefully curated reference data (which is expensive), or they only work well for tasks with clear “right or wrong” answers. What about the messy middle—subjective tasks like essay writing, dialogue, or storytelling?

This is where a bold idea has gained traction: using AI itself to evaluate AI. This approach, often called AI as a judge or LLM-as-a-judge, flips the problem around. Instead of relying on humans for every evaluation, we ask an AI system to decide whether another AI’s output is good, bad, or somewhere in between.

Why AI as a Judge?

The appeal is obvious:

Scalability: Human evaluation is slow and costly. AI judges can evaluate thousands of responses in minutes.
Flexibility: An AI judge can be instructed to evaluate almost any dimension—correctness, style, tone, helpfulness, factual accuracy, toxicity, or creativity.
No reference data required: AI judges can evaluate in production environments where ground truths don’t exist.

Imagine building a customer support chatbot. Instead of manually grading every response, you could use an AI judge to flag whether the answer was relevant, polite, and safe. Similarly, in research settings, AI judges can rapidly compare competing models’ outputs across diverse tasks.

How AI as a Judge Works

There are several ways to implement this:

Self-Assessment
The AI evaluates its own output. For example:
“Given the question and my response, rate how helpful my response is on a scale of 1–5.”
Reference Comparison
The AI compares a generated output against a reference answer, acting as a more flexible alternative to lexical similarity:
“Does this answer match the reference answer in meaning? Output True or False.”
Head-to-Head Comparison
The AI judges between two outputs and decides which is better:
“Here are two answers to the same question. Which is more accurate, A or B?”

This comparative setup is especially powerful for ranking models or generating preference data, which is later used for reinforcement learning from human feedback (RLHF).

Evidence That It Works

Skepticism is natural—can you really trust AI to grade itself? Interestingly, research suggests the answer is often yes.

A 2023 study found that GPT-4’s judgments on certain benchmarks correlated 85% with human evaluations—higher than the agreement rate between humans themselves (~81%).
The AlpacaEval project reported near-perfect correlation (0.98) between AI judges and human-graded leaderboards.
In practice, startups and platforms like LangChain, MLflow, and Azure AI Studio have already built AI-judge functionality into their evaluation pipelines.

This doesn’t mean AI judges are perfect, but it does suggest they’re “good enough” for many tasks, especially in early development when rapid feedback matters more than flawless precision.

Benefits in Practice

Speed: You can iterate quickly without waiting on human annotators.
Coverage: AI judges can evaluate at scale, spotting issues across vast datasets.
Auditability: Many AI judges not only score but also explain their reasoning, making evaluations more transparent.

For example, you might ask GPT-4 to rate the helpfulness of an answer and explain why it gave a score of 2/5. This explanation can reveal biases or misunderstandings in both the model and the judge.

Limitations and Risks

Of course, this approach has real pitfalls:

Inconsistency
AI models are probabilistic. Run the same judge twice, and you might get different scores. Prompting carefully and setting sampling parameters can reduce this, but not eliminate it.
Criteria Ambiguity
Unlike human-designed metrics, AI-judge criteria aren’t standardized. For example, one tool might define “faithfulness” differently than another. Scores aren’t always comparable across systems.
Bias
AI judges can have self-bias (favoring outputs from their own model family) or positional bias (preferring the first option in comparisons). They may also inherit societal biases from their training data.
Cost and Latency
Running a strong model like GPT-4 as both the generator and the judge doubles inference costs and can slow down applications. Using smaller models helps but may reduce reliability.
Trust Gap
Many teams remain uneasy about the “fox guarding the henhouse” problem: if we use AI to evaluate AI, are we simply amplifying its blind spots?

The Pragmatic View

Despite its flaws, AI-as-a-judge is often the most practical choice. Human evaluation remains the gold standard, but it doesn’t scale. Automated metrics like BLEU or ROUGE are too rigid. AI judges strike a middle ground—imperfect but fast, flexible, and increasingly correlated with human judgment.

The trick is to use them wisely:

Combine AI judges with occasional human spot-checks.
Standardize prompts and criteria for consistency.
Treat scores as directional signals, not absolute truths.

In this light, AI as a judge isn’t about replacing humans but about making evaluation continuous and scalable. It helps teams ship faster while still maintaining some confidence in quality.

As foundation models become more capable and open-ended, comparative methods will grow even more important. Rather than asking “Is this response correct?”, evaluation will increasingly ask “Which response is better?”—a subtle but critical shift.

That leads us to our next section: Comparative Evaluation and Ranking Models—how AI systems are benchmarked against each other and what it means for the future of competition in AI.

Section 6: Comparative Evaluation & Ranking Models

Up to this point, we’ve discussed evaluation in terms of absolute performance: does the model output match references, pass tests, or align with a judge’s criteria? But as foundation models proliferate, another question becomes just as important: how do different models compare against each other?

This shift from absolute correctness to relative quality has given rise to comparative evaluation and ranking models. Instead of asking, “Did the model get it right?” we increasingly ask, “Which model did it better?”

Why Comparative Evaluation?

Comparative evaluation reflects how people actually choose between AI tools. Imagine you’re deciding whether to use GPT, Claude, or LLaMA in your product. You don’t care only about whether each one can summarize a document—you care which one does it better.

Comparative evaluation is also essential for:

Model benchmarking: Ranking models on leaderboards like LMSYS Chatbot Arena or HuggingFace Open LLM Leaderboard.
Fine-tuning: Collecting preference data (A > B) to train models with reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO).
User-facing products: A/B testing chatbot responses, search results, or recommendation systems.

How It Works

Comparative evaluation usually follows one of three setups:

Pairwise Comparison
The simplest approach: present two model outputs side by side and ask an evaluator (human or AI judge) to choose the better one.
- Example: “Here are two summaries of the same article. Which is clearer and more accurate?”
Ranking Multiple Outputs
Instead of pairwise decisions, evaluators rank outputs from multiple models at once.
- Example: Given four chatbot answers, rank them from best to worst.
Elo or Bradley-Terry Models
Inspired by chess ranking systems, these statistical models take many pairwise comparisons and estimate a global ranking of models.
- The Chatbot Arena (by LMSYS) uses Elo scoring to crowdsource model rankings. Users vote in blind head-to-head comparisons, and models earn or lose “points” accordingly.

This framework transforms subjective judgments into a competitive leaderboard, where models rise and fall in real time based on user preferences.

Human vs. AI Judges in Comparative Evaluation

Comparative evaluation can be done by either humans or AI judges:

Humans: Provide high-quality, nuanced judgments but are costly, slow, and inconsistent at scale.
AI Judges: Provide fast, scalable judgments but can introduce biases (e.g., favoring outputs similar to their own training data).

A common compromise is to use AI judges for most comparisons, supplemented by periodic human calibration. For example, LMSYS Arena relies primarily on human votes, while AlpacaEval and MT-Bench lean heavily on AI judges.

Challenges in Comparative Evaluation

While powerful, comparative methods face several issues:

Bias
AI judges can show positional bias (preferring option A because it appears first) or model-family bias (favoring outputs closer to their own style). Even humans aren’t immune—users may prefer answers that sound confident, regardless of factual correctness.
Inconsistency
Rankings aren’t always transitive. A may be preferred to B, and B to C, but C to A—creating paradoxes in leaderboards. Statistical ranking methods smooth out some of this noise, but it’s never perfect.
Context Dependence
“Better” is not universal. One model may excel at reasoning, another at creativity, another at safety. Rankings that collapse all these dimensions into a single score can be misleading.
Evaluation Fatigue
Humans (and even AI judges) get inconsistent when asked to evaluate too many outputs. Comparative methods require careful design to avoid noise.

The Strength of Comparative Evaluation

Despite these challenges, comparative methods have become the gold standard for evaluating foundation models. Why? Because they align closely with real-world use. When end users interact with two different models, they don’t care about perplexity or BLEU scores. They care which model feels more helpful, reliable, or enjoyable to use.

Comparative evaluation also fuels progress through competition. Leaderboards create public pressure for improvement and transparency, much like ImageNet did for computer vision. While no leaderboard is perfect, they provide a shared benchmark that drives the field forward.

Example: LMSYS Chatbot Arena

The Chatbot Arena is perhaps the most famous example. Users are shown blind responses from two different models to the same prompt and asked: “Which do you prefer?” Over hundreds of thousands of comparisons, models earn Elo scores.

The results often surprise even researchers. Models that score similarly on academic benchmarks sometimes diverge widely in human preference. This reveals an important insight: what matters in the lab isn’t always what matters in the real world.

Looking Ahead

Comparative evaluation is likely to become even more central as the number of foundation models grows. In the near future, we may see:

Task-specific leaderboards (e.g., medical, legal, creative writing).
Dynamic leaderboards that adapt based on user context (e.g., “best for reasoning” vs. “best for politeness”).
Multi-dimensional rankings that break free from single scores, acknowledging that “best” depends on what you value.

Comparative evaluation moves us closer to a pragmatic truth: there is no single “best” AI model. Instead, there are trade-offs—between creativity and safety, accuracy and efficiency, reasoning and speed. Ranking methods help us navigate those trade-offs in a structured way.

But as with all of AI evaluation, the story doesn’t end here. Even comparative methods struggle with deeper issues of truth, trust, and societal impact. Which brings us to the final piece of the puzzle: Future Directions & Closing Thoughts.

Section 7: Future Directions & Closing Thoughts

We’ve traveled through the landscape of AI evaluation—from foundational metrics like perplexity, to exact methods of correctness, to the bold new frontier of AI-as-a-judge and comparative evaluation. Each method comes with strengths, weaknesses, and trade-offs, but taken together they form a toolkit that AI engineers rely on to ensure models are not only functional but also trustworthy, safe, and aligned with human expectations.

As AI models continue to evolve, so too must our methods of evaluating them. Let’s close with a look at the future directions that will shape this space in the years ahead.

1. Moving Beyond Benchmarks

Benchmarks have historically been the lifeblood of AI progress—ImageNet for vision, GLUE and SuperGLUE for NLP, MMLU for multitask reasoning. But benchmarks age quickly. Models saturate them, sometimes within months of release, and they stop being meaningful indicators of progress.

The future lies in dynamic and adaptive evaluation frameworks, ones that evolve alongside models. Instead of static datasets, we’ll see benchmarks that automatically generate new, harder challenges, possibly even using AI itself to refresh test sets. Think of it as an “evergreen exam” for AI systems—always a step ahead of the models being tested.

2. Human-in-the-Loop Evaluation

While AI judges and automated metrics are powerful, humans will remain indispensable. The key is hybrid evaluation, where humans and AI collaborate:

AI judges handle scale—thousands of evaluations per minute.
Humans provide calibration, context, and checks for subtle qualities like tone, fairness, or cultural sensitivity.

Platforms may increasingly use active learning to direct human evaluators to the most uncertain or contentious cases, making human time more efficient.

3. Domain-Specific Evaluation

General-purpose metrics (like perplexity or Elo rankings) tell only part of the story. For high-stakes applications—medicine, law, finance, education—domain-specific evaluation will become essential.

For example:

In healthcare, evaluating whether an AI correctly diagnoses conditions requires not only accuracy but also explainability and adherence to medical guidelines.
In law, outputs must be judged against jurisprudence and ethical standards, not just linguistic fluency.
In education, “good” responses must balance correctness with pedagogical value—is the explanation understandable and helpful to learners?

Expect to see a proliferation of specialized evaluation frameworks designed by domain experts in collaboration with AI engineers.

4. Multi-Dimensional Evaluation

Today, many leaderboards reduce performance to a single score. But real-world usefulness isn’t one-dimensional. A model might be highly accurate but unsafe, or very creative but inconsistent.

Future evaluation will likely move toward multi-dimensional dashboards. Instead of asking “Which model is best?”, we’ll ask:

Which model is most accurate?
Which is most helpful?
Which is safest?
Which balances speed and cost most effectively?

This approach acknowledges that “best” depends on context and values. A lawyer, a teacher, and a game designer may all want different things from the same AI system.

5. Continuous, Real-World Evaluation

One of the biggest shifts on the horizon is toward continuous evaluation. Models don’t operate in controlled lab settings—they run in production, serving millions of users.

Future pipelines will integrate evaluation directly into deployment:

Monitoring outputs in real time for safety and accuracy.
Using AI judges to flag problematic responses.
Collecting user feedback as ongoing evaluation data.

Think of it as DevOps for AI evaluation—not a one-time process, but a continuous cycle of monitoring, feedback, and improvement.

6. Evaluating Emergent Behaviors

Foundation models are notorious for exhibiting capabilities that weren’t explicitly trained, such as solving math problems or writing code. Evaluating these emergent behaviors is a frontier challenge.

Future evaluation may involve discovery frameworks—systems designed to probe models for hidden abilities, risks, and edge cases. This could look like adversarial testing, automated exploration, or even competitions where models are stress-tested under unpredictable conditions.

7. Ethics, Fairness, and Societal Impact

Perhaps the most important dimension is also the hardest to measure: ethics and social impact. Evaluating models on bias, fairness, and safety isn’t just about accuracy—it’s about trust.

The future may involve standardized ethical audits, much like financial audits today. Companies might be required to demonstrate not only how accurate their models are but also how they perform across different demographic groups, how they mitigate harmful outputs, and how transparent their evaluation methods are.

Closing Thoughts

Evaluation is often seen as the “boring” part of AI—less glamorous than training giant models or discovering new architectures. But in reality, evaluation is the beating heart of AI engineering. It determines not only whether a system works, but whether it can be trusted.

If training is about building intelligence, evaluation is about building confidence—in the model, in the product, and in the technology as a whole. Without rigorous evaluation, AI is a gamble. With it, AI becomes a reliable partner, capable of enhancing human work and creativity rather than undermining it.

As AI continues to weave itself into the fabric of society, the stakes will only grow higher. The methods we’ve discussed—functional correctness, reference-based evaluation, embeddings, AI-as-a-judge, comparative ranking—are just the beginning. The future will demand smarter, fairer, and more adaptive evaluation frameworks, designed not just for machines but for the humans who depend on them.

And perhaps that’s the ultimate lesson: evaluation is not just about AI. It’s about us—our standards, our expectations, and our willingness to hold technology accountable.

Tags: Artificial Intelligence,Agentic AI,Generative AI,Technology,Book Summary,

Pages

Tuesday, September 23, 2025

Evaluation Methodology - AI Engineering by Chip Huyen (Chapter 3)

Outline

Section 1: Introduction – Why Evaluation Matters

Section 2: The Unique Challenges of Evaluating Foundation Models

1. The Intelligence Paradox

2. Open-Endedness and the Ground Truth Problem

3. The Black-Box Nature of Foundation Models

4. Benchmark Saturation

5. Expanding Scope of Evaluation

6. Time and Cost of Evaluation

Section 3: Language Modeling Metrics Explained

1. Entropy – The Measure of Uncertainty

2. Cross-Entropy – Measuring Prediction Accuracy

3. Bits-Per-Character (BPC) and Bits-Per-Byte (BPB)

4. Perplexity – A More Intuitive Metric

5. Why These Metrics Matter

6. The Limits of Perplexity

Section 4: Exact Evaluation Methods

1. Functional Correctness – Does It Work?

2. Reference-Based Evaluation – Comparing to Ground Truth

3. Embedding-Based Evaluation – Measuring Meaning

4. Strengths and Weaknesses of Exact Evaluation

Why Exact Evaluation Still Matters

Section 5: AI as a Judge – Can AI Evaluate AI?

Why AI as a Judge?

How AI as a Judge Works

Evidence That It Works

Benefits in Practice

Limitations and Risks

The Pragmatic View

Section 6: Comparative Evaluation & Ranking Models

Why Comparative Evaluation?

How It Works

Human vs. AI Judges in Comparative Evaluation

Challenges in Comparative Evaluation

The Strength of Comparative Evaluation

Example: LMSYS Chatbot Arena

Looking Ahead

Section 7: Future Directions & Closing Thoughts

1. Moving Beyond Benchmarks

2. Human-in-the-Loop Evaluation

3. Domain-Specific Evaluation

4. Multi-Dimensional Evaluation

5. Continuous, Real-World Evaluation

6. Evaluating Emergent Behaviors

7. Ethics, Fairness, and Societal Impact

Closing Thoughts

No comments:

Post a Comment