[evaluations.evaluation_name]
The evaluations sub-section of the config file defines the behavior of an evaluation in TensorZero.
You can define multiple evaluations by including multiple [evaluations.evaluation_name] sections.
If your evaluation_name is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define an evaluation named foo.bar as [evaluations."foo.bar"].
tensorzero.toml
type
- Type: Literal
"inference"(we may add other options here later on) - Required: yes
function_name
- Type: string
- Required: yes
[functions] section of the gateway config.
This value sets which function this evaluation should evaluate when run.
description
- Type: string
- Required: no
[evaluations.evaluation_name.evaluators.evaluator_name]
The evaluators sub-section defines the behavior of a particular evaluator that will be run as part of its parent evaluation.
You can define multiple evaluators by including multiple [evaluations.evaluation_name.evaluators.evaluator_name] sections.
If your evaluator_name is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define includes.jpg as [evaluations.evaluation_name.evaluators."includes.jpg"].
tensorzero.toml
type
- Type: string
- Required: yes
Evaluator Types
TensorZero supports the following evaluator types:| Type | Description |
|---|---|
exact_match | Evaluates whether the generated output exactly matches the reference output (skips the datapoint if unavailable). |
llm_judge | Uses a TensorZero function as a judge to evaluate outputs. |
regex | Checks whether the generated text output matches (or doesn’t match) a regex pattern. |
tool_use | Checks whether the inference’s tool calls match the expected behavior. |
exact_match
exact_match
Evaluates whether the generated output exactly matches the reference output.
If the datapoint has no reference output, this evaluator is skipped.
- Metric type: Boolean
- Optimize: Max
- Requires reference output: yes
tensorzero.toml
llm_judge
llm_judge
Uses a TensorZero function as a judge to evaluate outputs.
This is the most flexible evaluator type, supporting both float and boolean output types with configurable optimization direction.
An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function.
Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our inference-time optimizations.You can include a standard variant configuration in this block, with two modifications:
- Metric type: Configurable (
floatorboolean) - Optimize: Configurable (
maxormin) - Requires reference output: optional (configurable)
input_format
- Type: string
- Required: no (default:
serialized)
serialized: Passes the input messages, generated output, and reference output (if included) as a single serialized string.messages: Passes the input messages, generated output, and reference output (if included) as distinct messages in the conversation history.
tensorzero.toml
output_type
- Type: string
- Required: yes
float: The judge is expected to return a floating-point number.boolean: The judge is expected to return a boolean value.
tensorzero.toml
include.reference_output
- Type: boolean
- Required: no (default:
false)
true, the reference output associated with the evaluation datapoint will be included in the input provided to the LLM judge.
In these cases, the evaluation run will not run this evaluator for datapoints where there is no reference output.tensorzero.toml
optimize
- Type: string
- Required: yes
max: Higher values are better.min: Lower values are better.
tensorzero.toml
description
- Type: string
- Required: no
[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]
An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function.
Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our inference-time optimizations.You can include a standard variant configuration in this block, with two modifications:- You must mark a single variant as
active. - For
chat_completionvariants, instead of asystem_templatewe requiresystem_instructionsas a text file and take no other templates.
tensorzero.toml
active
- Type: boolean
- Required: Defaults to
trueif there is a single variant configured. Otherwise, this field is required to be set totruefor exactly one variant.
tensorzero.toml
system_instructions
- Type: string (path)
- Required: yes
{"thinking": "<thinking>", "score": <float or boolean>} configured to the output_type of the evaluator.evaluations/email-guardrails/check-signature/claude_sonnet_4_5/system_instructions.txt
tensorzero.toml
regex
regex
Checks whether the inference’s text output matches (or doesn’t match) a regex pattern.
This is useful for content filtering (e.g. profanity detection) and format validation (e.g. ensuring the model includes a specific phrase).For Chat responses, all text content blocks are concatenated.
For JSON responses, the raw JSON string is used.
- Metric type: Boolean
- Optimize: Max
- Requires reference output: no
must_match or must_not_match must be specified.
If both are specified, the result is the logical AND: must_match matches AND must_not_match does not match.must_match
- Type: string (regex pattern)
- Required: no (but at least one of
must_matchormust_not_matchmust be specified)
(?i) for case-insensitive matching.must_not_match
- Type: string (regex pattern)
- Required: no (but at least one of
must_matchormust_not_matchmust be specified)
(?i) for case-insensitive matching.tensorzero.toml
tool_use
tool_use
Checks whether the inference’s tool calls match the expected behavior.
This evaluator only supports Chat inferences (tool calls do not exist in JSON inferences).
- Metric type: Boolean
- Optimize: Max
- Requires reference output: no
behavior
- Type: string
- Required: yes
| Behavior | Description |
|---|---|
none | The inference must not contain any tool calls. |
any | The inference must contain at least one tool call (any tool). |
none_of | None of the listed tools may appear in the inference’s tool calls. |
any_of | At least one of the listed tools must appear in the inference’s tool calls. |
all_of | All of the listed tools must appear in the inference’s tool calls. |
tools
- Type: list of strings
- Required: Required for
none_of,any_of, andall_ofbehaviors. Must not be specified fornoneandany.
tensorzero.toml