Even (very) noisy LLM evaluators are useful for improving AI agents

· Alan Mishler

Summary

  • LLM evaluators are often noisy and weakly correlated with real-world outcomes.

  • Noisy evaluators have limited value for production decisions that hinge on judging a single output (e.g. guardrails).

  • However, even (very) noisy evaluators can reliably tell you which agent is better on average, meaning they can still help you pick the best variant to deploy and improve it over time.

It’s surprisingly hard to develop reliable LLM evaluators: they’re often noisy and poorly correlated with the metrics or outcomes practitioners actually care about. Sometimes the target is directly measurable but evaluators still disagree with experts (e.g. on correctness or faithfulness to a source document). Other times the target is only accessible through a proxy (e.g. whether code that passes tests satisfies user needs). And sometimes the target is hard to observe at all (e.g. whether a customer was actually happy with an interaction).

Why is it so hard to develop reliable LLM evaluators?

Rule-based and classical NLP metrics are often brittle and miss the semantic dimensions that matter.1, 2 Learned reward models are vulnerable to distribution shift3 and reward hacking.4 Studies of LLM-as-a-judge setups have repeatedly documented systematic biases and limitations: judges are heavily swayed by surface-level style,5 prefer longer responses to shorter ones of similar quality,6 are inconsistent across repeated evaluators and minor prompt variations,7 often align poorly with human judgments,8 and may correlate weakly with the downstream outcomes they’re meant to predict.9

An evaluator’s quality can be measured at two granularities:

Even very noisy evaluators can be reliable for offline selection: enough to ship better agents today and keep improving them over time.

Why noisy evaluators can still rank agents

The key insight is that even a very noisy evaluator can yield scores that are higher on average for agents that truly are higher quality: the noise washes out over many samples.

To formalize this, suppose we have two agents we want to compare, AA and BB. Let μA\mu_A and μB\mu_B represent the mean true scores for AA vs BB in the problem setting of interest, where true score refers to the thing we’d ideally want to measure, like how well the agent handled a customer’s query or whether it produced runnable code. Suppose that higher scores are better. Then we’d say that AA is better than BB if μA>μB\mu_A > \mu_B.

Now suppose we have an evaluator whose scores can be regarded as noisy versions of the true scores. Here are three hypothetical samples of true scores and evaluator scores for increasingly noisy evaluators:

Slightly noisy
Moderately noisy
Very noisy
Hypothetical samples of true scores (x-axis) and evaluator scores (y-axis) for three evaluators with increasing noise. The dashed line marks y = x (a perfect evaluator).

The leftmost evaluator is accurate enough to judge individual outputs in production. The rightmost isn’t: its verdict on any single output is too noisy to trust.

However, if we’re using an evaluator offline to choose between AA and BB, then we don’t need every individual value to be accurate. We just need the evaluator to tell us which agent is better overall. All three evaluators will do that, given sufficiently large evaluation samples.

Suppose Agent AA has true-score mean μA=0.6\mu_A = 0.6 and Agent BB has μB=0.3\mu_B = 0.3, so AA is the better agent. Below are the same scatterplots as in the figure above, but with each output now colored by which agent it came from. Let μ^A\widehat{\mu}_A and μ^B\widehat{\mu}_B be the average evaluator scores for each agent, shown as horizontal dashed lines on each plot. In all three initial samples, μ^A>μ^B\widehat{\mu}_A > \widehat{\mu}_B, meaning the evaluator correctly leads us to choose the better agent.

Agent AAgent Bμ^A\hat{\mu}_Aμ^B\hat{\mu}_B
Slightly noisy
Moderately noisy
Very noisy
Samples drawn from the true-score distributions of Figure 2 (Agent A with mean 0.6, Agent B with mean 0.3; 30 samples per agent), with evaluator scores on the y-axis. Horizontal dashed lines mark each agent's empirical mean evaluator score (μ^\hat{\mu}); the diagonal marks y=xy = x.
Sampling details

For each agent, we model the distribution of true scores as a Beta distribution parameterized by its mean μ(0,1)\mu \in (0, 1) and a fixed concentration κ>0\kappa > 0:

SBeta(κμ,  κ(1μ)).S \sim \text{Beta}\left(\kappa\mu,\; \kappa(1 - \mu)\right).

The mean of the distribution is exactly μ\mu, and increasing κ\kappa concentrates mass more tightly around μ\mu. We use κ=5\kappa = 5, μA=0.6\mu_A = 0.6, and μB=0.3\mu_B = 0.3, which gives each agent a unimodal distribution with moderate spread while keeping the two visually comparable.

Agent AAgent B
True score distributions for Agent A (μA = 0.6) and Agent B (μB = 0.3), modeled as Beta densities with concentration κ = 5. Dashed vertical lines mark each agent's mean.

For each evaluator with noise σ\sigma, the evaluator score for an output with true score SS is

V=clip(S+ε,  0,  1),εN(0,σ2),V = \text{clip}(S + \varepsilon,\; 0,\; 1),\qquad \varepsilon \sim \mathcal{N}(0, \sigma^2),

with σ=0.03,  0.12,  0.25\sigma = 0.03,\; 0.12,\; 0.25 for the slightly, moderately, and very noisy evaluators, respectively. Each click of the “Draw new samples” button shows a fresh random realization with N=30N = 30 trajectories per agent; the empirical means μ^A,μ^B\widehat{\mu}_A, \widehat{\mu}_B are the average evaluator scores within each agent’s sample.

Even though the samples are noisier as we move from left to right, they still tend to produce the correct ordering (μ^A>μ^B\widehat{\mu}_A > \widehat{\mu}_B) once they’re averaged. Of course, these values are random, so there’s always some chance that the empirical means will mislead us, pointing to the worse agent as the better one. How likely that is depends on a few things:

In general, even noisy evaluators can reliably distinguish stronger from weaker agents, given a sufficiently large evaluation dataset.

How big does an evaluation dataset need to be?

The sample size required to reliably distinguish two agents scales inversely with the square of the performance gap between them — halving the gap roughly quadruples the number of samples you need. This squared scaling comes from how the sampling distribution of a mean tightens with NN: the variance of a sample mean shrinks as 1/N1/N, so its standard error shrinks as 1/N1/\sqrt{N}, and reliably resolving a gap of size Δ\Delta requires the standard error to be small relative to Δ\Delta — i.e., NN to grow as 1/Δ21/\Delta^2. The interactive figure above is illustrative: with a 0.30 gap and only N=30N = 30 samples per agent, even the noisiest of the three evaluators gets the ordering right essentially every draw. For agents that differ by 5 to 10 percentage points on an outcome of interest — a typical magnitude in practice — even a fairly noisy evaluator can give the correct ranking with high probability on a few hundred to a few thousand examples.

The argument above works as long as the evaluator is not biased in a way that causes it to systematically favor the worse variant.

Formal argument and failure modes

This section formalizes the claim that the empirical evaluator means can recover the true ordering of agents given enough samples, and discusses when this can fail.

Let xx be an agent output, and let S(x)S(x) be the true score for xx, meaning the thing we’d ideally like to measure. Each agent gives rise to a distribution over trajectories xx, which can differ in arbitrary ways. Maybe AA tends to produce long, detailed responses, while BB tends to be short and crisp. Or maybe AA is more prone to hallucination than BB. The average scores μA\mu_A and μB\mu_B are expected values over those distributions, which we’ll denote with EA\mathbb{E}_A and EB\mathbb{E}_B, so that

μA=EA[S(x)],\mu_A = \mathbb{E}_A[S(x)],μB=EB[S(x)].\mu_B = \mathbb{E}_B[S(x)].

Now let VV be an evaluator, so that V(x)V(x) is the score the evaluator assigns to output xx. Define a noise term ε(x):=V(x)S(x)\varepsilon(x) := V(x) - S(x). Then we have

V(x)=S(x)+ε(x).(1)V(x) = S(x) + \varepsilon(x). \tag{1}

The empirical evaluator means converge in general to the true population evaluator means:

μ^AEA[V(x)],μ^BEB[V(x)].\widehat{\mu}_A \rightarrow \mathbb{E}_A[V(x)], \\ \widehat{\mu}_B \rightarrow \mathbb{E}_B[V(x)].

Therefore, in order for the empirical means to lead us to choose the better agent with high probability, we need the expected evaluator scores to mirror the true ordering of agents. That is, we desire the following property to hold:

EA[V(x)]>EB[V(x)]    EA[S(x)]>EB[S(x)](2)\mathbb{E}_A[V(x)] > \mathbb{E}_B[V(x)] \iff \mathbb{E}_A[S(x)] > \mathbb{E}_B[S(x)] \tag{2}

for all pairs of candidate agents AA and BB. From Eq. (1) and the linearity of expectation, we have

EA[V(x)]=EA[S(x)]+EA[ε(x)],\mathbb{E}_A[V(x)] = \mathbb{E}_A[S(x)] + \mathbb{E}_A[\varepsilon(x)],EB[V(x)]=EB[S(x)]+EB[ε(x)]\mathbb{E}_B[V(x)] = \mathbb{E}_B[S(x)] + \mathbb{E}_B[\varepsilon(x)]    (EA[V(x)]EB[V(x)])=(EA[S(x)]EB[S(x)])+(EA[ε(x)]EB[ε(x)])\begin{aligned} \implies \left(\mathbb{E}_A[V(x)] - \mathbb{E}_B[V(x)]\right) ={} & \left(\mathbb{E}_A[S(x)] - \mathbb{E}_B[S(x)]\right) + {} \\ & \left(\mathbb{E}_A[\varepsilon(x)] - \mathbb{E}_B[\varepsilon(x)]\right) \end{aligned}

From this, we see that Eq. (2) can be satisfied in many different ways. For example, if the noise terms average out to 0 for both distributions (a common assumption), then we’d have

(EA[V(x)]EB[V(x)])=(EA[S(x)]EB[S(x)]),\left(\mathbb{E}_A[V(x)] - \mathbb{E}_B[V(x)]\right) = \left(\mathbb{E}_A[S(x)] - \mathbb{E}_B[S(x)]\right),

from which Eq. (2) would immediately follow. Even if the noise doesn’t average out to 0 — meaning the evaluator systematically under- or over-estimates the true scores — we’d still obtain the same result, as long as it averages out to the same value for both agents. More generally, Eq. (2) can hold as long as the per-agent noise gap EA[ε(x)]EB[ε(x)]\mathbb{E}_A[\varepsilon(x)] - \mathbb{E}_B[\varepsilon(x)] doesn’t have the wrong sign and a large enough magnitude to reverse the true-score gap EA[S(x)]EB[S(x)]\mathbb{E}_A[S(x)] - \mathbb{E}_B[S(x)]. Equivalently, the evaluator can have arbitrary per-agent biases, as long as those biases don’t favor the worse agent strongly enough to overturn its true disadvantage.

In practice, of course, we never observe the expectations EA[V(x)]\mathbb{E}_A[V(x)] and EB[V(x)]\mathbb{E}_B[V(x)] directly; we only have the empirical means μ^A,μ^B\widehat{\mu}_A, \widehat{\mu}_B computed over a finite evaluation dataset, and on any given draw those empirical means might disagree with the true ordering. The claim is that, when Eq. (2) holds and trajectories are sampled appropriately, the probability of disagreement vanishes as the evaluation dataset grows:

P(μ^A>μ^B)1asN(3)\mathbb{P}\left(\widehat{\mu}_A > \widehat{\mu}_B\right) \to 1 \quad \text{as} \quad N \to \infty \tag{3}

where NN is the size of the evaluation dataset, and assuming without loss of generality that μA>μB\mu_A > \mu_B. In plain terms: with a large enough evaluation dataset, we will correctly identify the better agent with arbitrarily high probability — even when the evaluator is noisy on individual trajectories. We omit a formal proof here, but (3) can be shown to hold under typical sampling regimes, such as iid sampling, stationary ergodic processes, or other forms of sufficiently weak dependence.

Where this breaks down. The conditions above are sufficient for the empirical mean evaluator scores to recover the true ordering with high probability, but they aren’t always satisfied in practice. A few common failure modes:

  • Region-specific bias.10 If the evaluator’s bias varies across the score range — say, it gives polished-looking outputs extra credit regardless of correctness — and the agents under comparison concentrate their outputs in different regions, then EA[ε(x)]EB[ε(x)]\mathbb{E}_A[\varepsilon(x)] \neq \mathbb{E}_B[\varepsilon(x)] and the per-agent noise gap can flip the sign of Eq. (2). More data causes the empirical means to converge to biased values rather than the truth.
  • Distribution shift between offline and online. The offline test set may not match the distribution the deployed agent encounters in production. If the evaluator’s noise behaves differently on those two distributions, an offline ranking won’t necessarily predict online behavior — even when the offline argument goes through cleanly.
  • Strong dependence or non-stationarity. The convergence claim (3) tolerates iid, ergodic, or weakly dependent sampling, but strongly correlated trajectories can prevent the empirical means from converging.

These failure modes aren’t specific to offline agent selection: they affect any use of an evaluator, online or offline.

How this works in real benchmarks

To see this phenomenon in action with real evaluation data, we ran LLM-generated evaluators on five tasks: Gridworld, Wordle, Data Extraction (NER), Data Extraction (NDA), and Business Management. We evaluated 25 agent variants (different prompts and models) per task and 50 test traces per variant.

Each environment comes with a target metric that is computed programmatically and serves as the ground truth for any given trace: success or failure for Gridworld and Wordle, exact match against gold annotations for Data Extraction (NER), F1 score against gold annotations for Data Extraction (NDA), and number of subtasks completed for Business Management.

For each task we compute both correlations introduced above: the output-level correlation between an evaluator’s score on a single trace and that trace’s ground truth (holding the variant fixed), and the agent-level correlation between an evaluator’s mean over a variant’s traces and the variant’s ground-truth mean. The agent-level correlation exceeds the output-level correlation in every environment, often by a wide margin.

Pearson correlation between evaluator score and ground truth, at two granularities, across five environments. The output-level correlation is consistently weaker than the agent-level correlation — in every environment, the evaluator is more reliable for ranking agent variants than for judging individual outputs.

For example, Wordle’s output-level correlation is 0.41 — the evaluator is only modestly better than random at predicting which of two Wordle traces is better, holding the variant fixed. Its agent-level correlation is 0.96 — averaging across many traces per variant compresses the per-output noise into a much stronger signal of agent quality.

To further quantify the evaluator’s alignment with agent quality, we ask: when the evaluator compares two variants, how often does it pick the better one? That’s the evaluator’s pairwise win rate: the fraction of variant pairs where the evaluator’s mean ordering agrees with the ground-truth ordering. A win rate of 1.0 means perfect ranking; 0.5 is no better than random. We computed the win rate across all (252)=300\binom{25}{2} = 300 variant pairs for each environment:

Pairwise win rate of the evaluator on each environment: out of all variant pairs, the fraction where the evaluator-mean ordering matches the ground-truth-mean ordering, both computed over all available traces per variant. Equivalent to the area under the curve (AUC) of an "is variant A better than B" classifier; random selection scores 0.5 (dashed line), perfect selection 1.0.

Every evaluator we tested clears 0.5 (random) by a comfortable margin. Gridworld is nearly perfect: the evaluator picks the better of two variants 97% of the time. Wordle and Data Extraction (NER) are also very high, at 0.87 and 0.82, respectively. The same Wordle evaluator that you wouldn’t trust to gate individual outputs in production can reliably tell you which Wordle agent to ship. NDA and Business Management both have a win rate of 0.64 — less reliable than the others, but still meaningfully better than coin-flipping. Used as a selection signal, all of these evaluators will move you in the right direction on average, despite their very low utility for judging individual traces.

Benchmark setup and methodology

Agent and evaluator generation. Each task’s 25 agent variants differ in their system prompt and the underlying model they use. Both the agents and evaluators were LLM-generated: the LLM was told that each evaluator should be as highly correlated as possible with the task’s ground-truth metric, and was given access to a training set when generating the evaluators. We used small models and didn’t aim to optimize either the agents or the evaluators — the point isn’t to show how well an LLM can perform at these tasks, but to illustrate how even noisy evaluators can be useful for agent selection.

Pearson vs Spearman. The numbers above are Pearson correlations, which capture linear association — a good fit for the output-level use case (e.g. guardrails), where the magnitudes of evaluator scores feed directly into downstream decisions. Spearman correlations capture rank association and are a more natural fit for the variant selection use case, where what matters is whether the evaluator’s ranking matches the ground-truth ranking. The two metrics agree qualitatively in every environment we measured (both increase from output-level to agent-level, both consistent with the central claim) and are mathematically equivalent in the special case that both variables are binary.

TaskOutput-level: Pearson / Spearman correlationAgent-level: Pearson / Spearman correlation
Gridworld0.81 / 0.810.97 / 0.98
Wordle0.41 / 0.380.96 / 0.88
Data Extraction (NER)0.08 / 0.270.75 / 0.79
Data Extraction (NDA)0.28 / 0.240.43 / 0.38
Business Management0.22 / 0.190.50 / 0.45

The largest divergence between the two metrics is on the Data Extraction (NER) task’s output-level cell, where ground truth is binary (exact match) and the Pearson value (0.08) is low because the evaluator’s discrete partial-credit scores compress more than they correlate linearly with the binary outcome.

Within-variant decomposition by de-meaning. The “output-level” bars in the chart above are not raw per-(variant, trace) correlations; those would mix two distinct effects: a between-variant effect (the evaluator agrees with ground truth on which variants are good on average) and a within-variant effect (the evaluator agrees with ground truth on which traces within a single variant are good). To isolate the within-variant effect, for each variant we subtract that variant’s mean from both the evaluator score and the ground-truth score, then compute the correlation across all (variant, trace) cells on the de-meaned values. What’s left is the part of the evaluator-vs-ground-truth signal that is not explained by knowing which variant produced the output — i.e., the signal an evaluator would need to reliably grade individual outputs in production.

The Takeaway

Noisy evaluators can’t reliably judge individual agent outputs, but they can reliably distinguish overall agent performance, because the per-output noise averages out across many samples.

Per-output unreliability is what limits noisy evaluators for typical production tasks (e.g. guardrails), all of which hinge on trusting the verdict on any specific output. Reliability at the aggregate level is what makes them useful offline: on every environment we tested, the evaluator picked the better of two variants substantially more often than not. Used as a selection signal, even noisy evaluators can help you ship better-performing agents today and improve them over time.

Start building today. Check out our 11KGitHub and Quick Start .

Subscribe to our frontier AI engineering newsletter