Skip to main content
The configuration for TensorZero Evaluations should go in the same tensorzero.toml file as the rest of your TensorZero configuration.

[evaluations.evaluation_name]

The evaluations sub-section of the config file defines the behavior of an evaluation in TensorZero. You can define multiple evaluations by including multiple [evaluations.evaluation_name] sections. If your evaluation_name is not a basic string, it can be escaped with quotation marks. For example, periods are not allowed in basic strings, so you can define an evaluation named foo.bar as [evaluations."foo.bar"].
tensorzero.toml
[evaluations.email-guardrails]
# ...

type

  • Type: Literal "inference" (we may add other options here later on)
  • Required: yes

function_name

  • Type: string
  • Required: yes
This should be the name of a function defined in the [functions] section of the gateway config. This value sets which function this evaluation should evaluate when run.

description

  • Type: string
  • Required: no
An optional description of the evaluation. It helps TensorZero Autopilot understand the intent behind the evaluation.

[evaluations.evaluation_name.evaluators.evaluator_name]

The evaluators sub-section defines the behavior of a particular evaluator that will be run as part of its parent evaluation. You can define multiple evaluators by including multiple [evaluations.evaluation_name.evaluators.evaluator_name] sections. If your evaluator_name is not a basic string, it can be escaped with quotation marks. For example, periods are not allowed in basic strings, so you can define includes.jpg as [evaluations.evaluation_name.evaluators."includes.jpg"].
tensorzero.toml
[evaluations.email-guardrails]
# ...

[evaluations.email-guardrails.evaluators."includes.jpg"]
# ...

[evaluations.email-guardrails.evaluators.check-signature]
# ...

type

  • Type: string
  • Required: yes
Defines the type of the evaluator. See Evaluator Types below.

Evaluator Types

TensorZero supports the following evaluator types:
TypeDescription
exact_matchEvaluates whether the generated output exactly matches the reference output (skips the datapoint if unavailable).
llm_judgeUses a TensorZero function as a judge to evaluate outputs.
regexChecks whether the generated text output matches (or doesn’t match) a regex pattern.
tool_useChecks whether the inference’s tool calls match the expected behavior.

exact_match

Evaluates whether the generated output exactly matches the reference output. If the datapoint has no reference output, this evaluator is skipped.
  • Metric type: Boolean
  • Optimize: Max
  • Requires reference output: yes
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-exact]
type = "exact_match"

llm_judge

Uses a TensorZero function as a judge to evaluate outputs. This is the most flexible evaluator type, supporting both float and boolean output types with configurable optimization direction.
  • Metric type: Configurable (float or boolean)
  • Optimize: Configurable (max or min)
  • Requires reference output: optional (configurable)
input_format
  • Type: string
  • Required: no (default: serialized)
Defines the format of the input provided to the LLM judge.
  • serialized: Passes the input messages, generated output, and reference output (if included) as a single serialized string.
  • messages: Passes the input messages, generated output, and reference output (if included) as distinct messages in the conversation history.
We only support evaluations with image data when input_format is set to messages.
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
input_format = "messages"
# ...
output_type
  • Type: string
  • Required: yes
Defines the expected data type of the evaluation result from the LLM judge.
  • float: The judge is expected to return a floating-point number.
  • boolean: The judge is expected to return a boolean value.
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
output_type = "float"
# ...
include.reference_output
  • Type: boolean
  • Required: no (default: false)
If set to true, the reference output associated with the evaluation datapoint will be included in the input provided to the LLM judge. In these cases, the evaluation run will not run this evaluator for datapoints where there is no reference output.
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
include = { reference_output = true }
# ...
optimize
  • Type: string
  • Required: yes
Defines whether the metric produced by the LLM judge should be maximized or minimized.
  • max: Higher values are better.
  • min: Lower values are better.
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max"
# ...
description
  • Type: string
  • Required: no
An optional description of the evaluator. It helps TensorZero Autopilot understand the intent behind the evaluator.
[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]
An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function. Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our inference-time optimizations.You can include a standard variant configuration in this block, with two modifications:
  • You must mark a single variant as active.
  • For chat_completion variants, instead of a system_template we require system_instructions as a text file and take no other templates.
Here we list only the configuration for variants that differs from the configuration for a normal TensorZero function. Please refer the variant configuration reference for the remaining options.
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max"

[evaluations.email-guardrails.evaluators.check-signature.variants.claude_sonnet_4_5]
type = "chat_completion"
model = "anthropic::claude-sonnet-4-5"
temperature = 0.1
system_instructions = "./evaluations/email-guardrails/check-signature/system_instructions.txt"
# ... other chat completion configuration ...

[evaluations.email-guardrails.evaluators.check-signature.variants.mix_of_3]
active = true  # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluator
type = "experimental_mixture_of_n"
candidates = ["claude_sonnet_4_5", "claude_sonnet_4_5", "claude_sonnet_4_5"]
active
  • Type: boolean
  • Required: Defaults to true if there is a single variant configured. Otherwise, this field is required to be set to true for exactly one variant.
Sets which of the variants should be used for evaluation runs.
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...

[evaluations.email-guardrails.evaluators.check-signature.variants.mix_of_3]
active = true # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluator
type = "experimental_mixture_of_n"
# ...
system_instructions
  • Type: string (path)
  • Required: yes
Defines the path to the system instructions file. This path is relative to the configuration file.This file should contain a text file with the system instructions for the LLM judge. These instructions should instruct the judge to output a float or boolean value. We use JSON mode to enforce that the judge returns a JSON object of the form {"thinking": "<thinking>", "score": <float or boolean>} configured to the output_type of the evaluator.
evaluations/email-guardrails/check-signature/claude_sonnet_4_5/system_instructions.txt
Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.
tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
system_instructions = "./evaluations/email-guardrails/check-signature/claude_sonnet_4_5/system_instructions.txt"
# ...

regex

Checks whether the inference’s text output matches (or doesn’t match) a regex pattern. This is useful for content filtering (e.g. profanity detection) and format validation (e.g. ensuring the model includes a specific phrase).For Chat responses, all text content blocks are concatenated. For JSON responses, the raw JSON string is used.
  • Metric type: Boolean
  • Optimize: Max
  • Requires reference output: no
At least one of must_match or must_not_match must be specified. If both are specified, the result is the logical AND: must_match matches AND must_not_match does not match.
must_match
  • Type: string (regex pattern)
  • Required: no (but at least one of must_match or must_not_match must be specified)
A regex pattern that the inference output must match for the evaluation to pass. Use inline flags like (?i) for case-insensitive matching.
must_not_match
  • Type: string (regex pattern)
  • Required: no (but at least one of must_match or must_not_match must be specified)
A regex pattern that the inference output must not match for the evaluation to pass. Use inline flags like (?i) for case-insensitive matching.
tensorzero.toml
# Ensure the model says "please" (case-insensitive)
[evaluations.politeness.evaluators.says-please]
type = "regex"
must_match = "(?i)please"

# Ensure the model doesn't use profanity
[evaluations.content-filter.evaluators.no-profanity]
type = "regex"
must_not_match = "(?i)(profanity|badword)"

# Both conditions: must say "please" and must not use profanity
[evaluations.tone-check.evaluators.polite-and-clean]
type = "regex"
must_match = "(?i)please"
must_not_match = "(?i)(profanity|badword)"

tool_use

Checks whether the inference’s tool calls match the expected behavior. This evaluator only supports Chat inferences (tool calls do not exist in JSON inferences).
  • Metric type: Boolean
  • Optimize: Max
  • Requires reference output: no
behavior
  • Type: string
  • Required: yes
The matching rule to apply to the inference’s tool calls.
BehaviorDescription
noneThe inference must not contain any tool calls.
anyThe inference must contain at least one tool call (any tool).
none_ofNone of the listed tools may appear in the inference’s tool calls.
any_ofAt least one of the listed tools must appear in the inference’s tool calls.
all_ofAll of the listed tools must appear in the inference’s tool calls.
tools
  • Type: list of strings
  • Required: Required for none_of, any_of, and all_of behaviors. Must not be specified for none and any.
The tool names to match against.
tensorzero.toml
# Ensure the model doesn't call any tools
[evaluations.no-tools.evaluators.no-tool-calls]
type = "tool_use"
behavior = "none"

# Ensure the model calls a specific tool
[evaluations.tool-check.evaluators.must-use-search]
type = "tool_use"
behavior = "any_of"
tools = ["search", "web_search"]

# Ensure the model calls all required tools
[evaluations.tool-check.evaluators.must-use-all]
type = "tool_use"
behavior = "all_of"
tools = ["search", "summarize"]