Even (very) noisy LLM evaluators are useful for improving AI agents
Summary
LLM evaluators are often noisy and weakly correlated with real-world outcomes.
Noisy evaluators have limited value for production decisions that hinge on judging a single output (e.g. guardrails).
However, even (very) noisy evaluators can reliably tell you which agent is better on average, meaning they can still help you pick the best variant to deploy and improve it over time.
It’s surprisingly hard to develop reliable LLM evaluators: they’re often noisy and poorly correlated with the metrics or outcomes practitioners actually care about. Sometimes the target is directly measurable but evaluators still disagree with experts (e.g. on correctness or faithfulness to a source document). Other times the target is only accessible through a proxy (e.g. whether code that passes tests satisfies user needs). And sometimes the target is hard to observe at all (e.g. whether a customer was actually happy with an interaction).
Why is it so hard to develop reliable LLM evaluators?
Rule-based and classical NLP metrics are often brittle and miss the semantic dimensions that matter.1, 2 Learned reward models are vulnerable to distribution shift3 and reward hacking.4 Studies of LLM-as-a-judge setups have repeatedly documented systematic biases and limitations: judges are heavily swayed by surface-level style,5 prefer longer responses to shorter ones of similar quality,6 are inconsistent across repeated evaluators and minor prompt variations,7 often align poorly with human judgments,8 and may correlate weakly with the downstream outcomes they’re meant to predict.9
An evaluator’s quality can be measured at two granularities:
- Output-level correlation measures how well its score on individual outputs matches real-world outcomes. It governs production workflows (e.g. guardrails), where decisions hinge on individual outputs and noisy evaluators are unreliable. We’ll call an evaluator noisy with respect to a metric or outcome of interest if its output-level correlation is low.
- Agent-level correlation measures how well its average over many outputs matches an agent’s real-world quality. It governs offline variant selection (e.g. picking the best prompt or model), and, unlike output-level correlation, it generally climbs with sample size as per-output noise averages out.
Even very noisy evaluators can be reliable for offline selection: enough to ship better agents today and keep improving them over time.
Why noisy evaluators can still rank agents
The key insight is that even a very noisy evaluator can yield scores that are higher on average for agents that truly are higher quality: the noise washes out over many samples.
To formalize this, suppose we have two agents we want to compare, and . Let and represent the mean true scores for vs in the problem setting of interest, where true score refers to the thing we’d ideally want to measure, like how well the agent handled a customer’s query or whether it produced runnable code. Suppose that higher scores are better. Then we’d say that is better than if .
Now suppose we have an evaluator whose scores can be regarded as noisy versions of the true scores. Here are three hypothetical samples of true scores and evaluator scores for increasingly noisy evaluators:
The leftmost evaluator is accurate enough to judge individual outputs in production. The rightmost isn’t: its verdict on any single output is too noisy to trust.
However, if we’re using an evaluator offline to choose between and , then we don’t need every individual value to be accurate. We just need the evaluator to tell us which agent is better overall. All three evaluators will do that, given sufficiently large evaluation samples.
Suppose Agent has true-score mean and Agent has , so is the better agent. Below are the same scatterplots as in the figure above, but with each output now colored by which agent it came from. Let and be the average evaluator scores for each agent, shown as horizontal dashed lines on each plot. In all three initial samples, , meaning the evaluator correctly leads us to choose the better agent.
Sampling details
For each agent, we model the distribution of true scores as a Beta distribution parameterized by its mean and a fixed concentration :
The mean of the distribution is exactly , and increasing concentrates mass more tightly around . We use , , and , which gives each agent a unimodal distribution with moderate spread while keeping the two visually comparable.
For each evaluator with noise , the evaluator score for an output with true score is
with for the slightly, moderately, and very noisy evaluators, respectively. Each click of the “Draw new samples” button shows a fresh random realization with trajectories per agent; the empirical means are the average evaluator scores within each agent’s sample.
Even though the samples are noisier as we move from left to right, they still tend to produce the correct ordering () once they’re averaged. Of course, these values are random, so there’s always some chance that the empirical means will mislead us, pointing to the worse agent as the better one. How likely that is depends on a few things:
- How separated the agents are. A bigger gap between and , relative to the variances of the scores, makes it easier to preserve the correct ordering under noise.
- How noisy the evaluator is. A less noisy evaluator narrows the spread (lowers the variances) of and , making the correct ordering more likely at any given sample size.
- How many evaluator samples we have. Empirical means concentrate around their expected values as the sample size grows, so larger evaluation datasets give more reliable comparisons — no matter how well separated the agents are or how noisy the evaluator.
In general, even noisy evaluators can reliably distinguish stronger from weaker agents, given a sufficiently large evaluation dataset.
How big does an evaluation dataset need to be?
The sample size required to reliably distinguish two agents scales inversely with the square of the performance gap between them — halving the gap roughly quadruples the number of samples you need. This squared scaling comes from how the sampling distribution of a mean tightens with : the variance of a sample mean shrinks as , so its standard error shrinks as , and reliably resolving a gap of size requires the standard error to be small relative to — i.e., to grow as . The interactive figure above is illustrative: with a 0.30 gap and only samples per agent, even the noisiest of the three evaluators gets the ordering right essentially every draw. For agents that differ by 5 to 10 percentage points on an outcome of interest — a typical magnitude in practice — even a fairly noisy evaluator can give the correct ranking with high probability on a few hundred to a few thousand examples.
The argument above works as long as the evaluator is not biased in a way that causes it to systematically favor the worse variant.
Formal argument and failure modes
This section formalizes the claim that the empirical evaluator means can recover the true ordering of agents given enough samples, and discusses when this can fail.
Let be an agent output, and let be the true score for , meaning the thing we’d ideally like to measure. Each agent gives rise to a distribution over trajectories , which can differ in arbitrary ways. Maybe tends to produce long, detailed responses, while tends to be short and crisp. Or maybe is more prone to hallucination than . The average scores and are expected values over those distributions, which we’ll denote with and , so that
Now let be an evaluator, so that is the score the evaluator assigns to output . Define a noise term . Then we have
The empirical evaluator means converge in general to the true population evaluator means:
Therefore, in order for the empirical means to lead us to choose the better agent with high probability, we need the expected evaluator scores to mirror the true ordering of agents. That is, we desire the following property to hold:
for all pairs of candidate agents and . From Eq. (1) and the linearity of expectation, we have
From this, we see that Eq. (2) can be satisfied in many different ways. For example, if the noise terms average out to 0 for both distributions (a common assumption), then we’d have
from which Eq. (2) would immediately follow. Even if the noise doesn’t average out to 0 — meaning the evaluator systematically under- or over-estimates the true scores — we’d still obtain the same result, as long as it averages out to the same value for both agents. More generally, Eq. (2) can hold as long as the per-agent noise gap doesn’t have the wrong sign and a large enough magnitude to reverse the true-score gap . Equivalently, the evaluator can have arbitrary per-agent biases, as long as those biases don’t favor the worse agent strongly enough to overturn its true disadvantage.
In practice, of course, we never observe the expectations and directly; we only have the empirical means computed over a finite evaluation dataset, and on any given draw those empirical means might disagree with the true ordering. The claim is that, when Eq. (2) holds and trajectories are sampled appropriately, the probability of disagreement vanishes as the evaluation dataset grows:
where is the size of the evaluation dataset, and assuming without loss of generality that . In plain terms: with a large enough evaluation dataset, we will correctly identify the better agent with arbitrarily high probability — even when the evaluator is noisy on individual trajectories. We omit a formal proof here, but (3) can be shown to hold under typical sampling regimes, such as iid sampling, stationary ergodic processes, or other forms of sufficiently weak dependence.
Where this breaks down. The conditions above are sufficient for the empirical mean evaluator scores to recover the true ordering with high probability, but they aren’t always satisfied in practice. A few common failure modes:
- Region-specific bias.10 If the evaluator’s bias varies across the score range — say, it gives polished-looking outputs extra credit regardless of correctness — and the agents under comparison concentrate their outputs in different regions, then and the per-agent noise gap can flip the sign of Eq. (2). More data causes the empirical means to converge to biased values rather than the truth.
- Distribution shift between offline and online. The offline test set may not match the distribution the deployed agent encounters in production. If the evaluator’s noise behaves differently on those two distributions, an offline ranking won’t necessarily predict online behavior — even when the offline argument goes through cleanly.
- Strong dependence or non-stationarity. The convergence claim (3) tolerates iid, ergodic, or weakly dependent sampling, but strongly correlated trajectories can prevent the empirical means from converging.
These failure modes aren’t specific to offline agent selection: they affect any use of an evaluator, online or offline.
How this works in real benchmarks
To see this phenomenon in action with real evaluation data, we ran LLM-generated evaluators on five tasks: Gridworld, Wordle, Data Extraction (NER), Data Extraction (NDA), and Business Management. We evaluated 25 agent variants (different prompts and models) per task and 50 test traces per variant.
Each environment comes with a target metric that is computed programmatically and serves as the ground truth for any given trace: success or failure for Gridworld and Wordle, exact match against gold annotations for Data Extraction (NER), F1 score against gold annotations for Data Extraction (NDA), and number of subtasks completed for Business Management.
For each task we compute both correlations introduced above: the output-level correlation between an evaluator’s score on a single trace and that trace’s ground truth (holding the variant fixed), and the agent-level correlation between an evaluator’s mean over a variant’s traces and the variant’s ground-truth mean. The agent-level correlation exceeds the output-level correlation in every environment, often by a wide margin.
For example, Wordle’s output-level correlation is 0.41 — the evaluator is only modestly better than random at predicting which of two Wordle traces is better, holding the variant fixed. Its agent-level correlation is 0.96 — averaging across many traces per variant compresses the per-output noise into a much stronger signal of agent quality.
To further quantify the evaluator’s alignment with agent quality, we ask: when the evaluator compares two variants, how often does it pick the better one? That’s the evaluator’s pairwise win rate: the fraction of variant pairs where the evaluator’s mean ordering agrees with the ground-truth ordering. A win rate of 1.0 means perfect ranking; 0.5 is no better than random. We computed the win rate across all variant pairs for each environment:
Every evaluator we tested clears 0.5 (random) by a comfortable margin. Gridworld is nearly perfect: the evaluator picks the better of two variants 97% of the time. Wordle and Data Extraction (NER) are also very high, at 0.87 and 0.82, respectively. The same Wordle evaluator that you wouldn’t trust to gate individual outputs in production can reliably tell you which Wordle agent to ship. NDA and Business Management both have a win rate of 0.64 — less reliable than the others, but still meaningfully better than coin-flipping. Used as a selection signal, all of these evaluators will move you in the right direction on average, despite their very low utility for judging individual traces.
Benchmark setup and methodology
Agent and evaluator generation. Each task’s 25 agent variants differ in their system prompt and the underlying model they use. Both the agents and evaluators were LLM-generated: the LLM was told that each evaluator should be as highly correlated as possible with the task’s ground-truth metric, and was given access to a training set when generating the evaluators. We used small models and didn’t aim to optimize either the agents or the evaluators — the point isn’t to show how well an LLM can perform at these tasks, but to illustrate how even noisy evaluators can be useful for agent selection.
Pearson vs Spearman. The numbers above are Pearson correlations, which capture linear association — a good fit for the output-level use case (e.g. guardrails), where the magnitudes of evaluator scores feed directly into downstream decisions. Spearman correlations capture rank association and are a more natural fit for the variant selection use case, where what matters is whether the evaluator’s ranking matches the ground-truth ranking. The two metrics agree qualitatively in every environment we measured (both increase from output-level to agent-level, both consistent with the central claim) and are mathematically equivalent in the special case that both variables are binary.
| Task | Output-level: Pearson / Spearman correlation | Agent-level: Pearson / Spearman correlation |
|---|---|---|
| Gridworld | 0.81 / 0.81 | 0.97 / 0.98 |
| Wordle | 0.41 / 0.38 | 0.96 / 0.88 |
| Data Extraction (NER) | 0.08 / 0.27 | 0.75 / 0.79 |
| Data Extraction (NDA) | 0.28 / 0.24 | 0.43 / 0.38 |
| Business Management | 0.22 / 0.19 | 0.50 / 0.45 |
The largest divergence between the two metrics is on the Data Extraction (NER) task’s output-level cell, where ground truth is binary (exact match) and the Pearson value (0.08) is low because the evaluator’s discrete partial-credit scores compress more than they correlate linearly with the binary outcome.
Within-variant decomposition by de-meaning. The “output-level” bars in the chart above are not raw per-(variant, trace) correlations; those would mix two distinct effects: a between-variant effect (the evaluator agrees with ground truth on which variants are good on average) and a within-variant effect (the evaluator agrees with ground truth on which traces within a single variant are good). To isolate the within-variant effect, for each variant we subtract that variant’s mean from both the evaluator score and the ground-truth score, then compute the correlation across all (variant, trace) cells on the de-meaned values. What’s left is the part of the evaluator-vs-ground-truth signal that is not explained by knowing which variant produced the output — i.e., the signal an evaluator would need to reliably grade individual outputs in production.
The Takeaway
Noisy evaluators can’t reliably judge individual agent outputs, but they can reliably distinguish overall agent performance, because the per-output noise averages out across many samples.
Per-output unreliability is what limits noisy evaluators for typical production tasks (e.g. guardrails), all of which hinge on trusting the verdict on any specific output. Reliability at the aggregate level is what makes them useful offline: on every environment we tested, the evaluator picked the better of two variants substantially more often than not. Used as a selection signal, even noisy evaluators can help you ship better-performing agents today and improve them over time.