The configuration for TensorZero Evaluations should go in the same
tensorzero.toml file as the rest of your TensorZero configuration.[evaluations.evaluation_name]
The evaluations sub-section of the config file defines the behavior of an evaluation in TensorZero.
You can define multiple evaluations by including multiple [evaluations.evaluation_name] sections.
If your evaluation_name is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define an evaluation named foo.bar as [evaluations."foo.bar"].
type
- Type: Literal
"static"(we may add other options here later on) - Required: yes
function_name
- Type: string
- Required: yes
[functions] section of the gateway config.
This value sets which function this evaluation should evaluate when run.
[evaluations.evaluation_name.evaluators.evaluator_name]
The evaluators sub-section defines the behavior of a particular evaluator that will be run as part of its parent evaluation.
You can define multiple evaluators by including multiple [evaluations.evaluation_name.evaluators.evaluator_name] sections.
If your evaluator_name is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define includes.jpg as [evaluations.evaluation_name.evaluators."includes.jpg"].
type
- Type: string
- Required: yes
| Type | Description |
|---|---|
llm_judge | Use a TensorZero function as a judge |
exact_match | Evaluates whether the generated output exactly matches the reference output (skips the datapoint if unavailable). |
type: "exact_match"
type: "exact_match"
cutoff
- Type: float
- Required: no
type: "llm_judge"
type: "llm_judge"
input_format
- Type: string
- Required: no (default:
serialized)
serialized: Passes the input messages, generated output, and reference output (if included) as a single serialized string.messages: Passes the input messages, generated output, and reference output (if included) as distinct messages in the conversation history.
We only support evaluations with image data when
input_format is set to messages.output_type
- Type: string
- Required: yes
float: The judge is expected to return a floating-point number.boolean: The judge is expected to return a boolean value.
include.reference_output
- Type: boolean
- Required: no (default:
false)
true, the reference output associated with the evaluation datapoint will be included in the input provided to the LLM judge.
In these cases, the evaluation run will not run this evaluator for datapoints where there is no reference output.optimize
- Type: string
- Required: yes
max: Higher values are better.min: Lower values are better.
cutoff
- Type: float
- Required: no
optimize is max) or above the cutoff (when optimize is min), the evaluations binary will return a nonzero status code.[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]
An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function.
Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our inference-time optimizations.You can include a standard variant configuration in this block, with two modifications:- Instead of assigning
weightto each variant, you simply mark a single variant asactive. - For
chat_completionvariants, instead of asystem_templatewe requiresystem_instructionsas a text file and take no other templates.
active
- Type: boolean
- Required: Defaults to
trueif there is a single variant configured. Otherwise, this field is required to be set totruefor exactly one variant.
system_instructions
- Type: string (path)
- Required: yes
{"thinking": "<thinking>", "score": <float or boolean>} configured to the output_type of the evaluator.evaluations/email-guardrails/check-signature/claude_35_sonnet/system_instructions.txt