The configuration for TensorZero Evaluations should go in the same
tensorzero.toml
file as the rest of your TensorZero configuration.[evaluations.evaluation_name]
The evaluations
sub-section of the config file defines the behavior of an evaluation in TensorZero.
You can define multiple evaluations by including multiple [evaluations.evaluation_name]
sections.
If your evaluation_name
is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define an evaluation named foo.bar
as [evaluations."foo.bar"]
.
type
- Type: Literal
"static"
(we may add other options here later on) - Required: yes
function_name
- Type: string
- Required: yes
[functions]
section of the gateway config.
This value sets which function this evaluation should evaluate when run.
[evaluations.evaluation_name.evaluators.evaluator_name]
The evaluators
sub-section defines the behavior of a particular evaluator that will be run as part of its parent evaluation.
You can define multiple evaluators by including multiple [evaluations.evaluation_name.evaluators.evaluator_name]
sections.
If your evaluator_name
is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define includes.jpg
as [evaluations.evaluation_name.evaluators."includes.jpg"]
.
type
- Type: string
- Required: yes
Type | Description |
---|---|
llm_judge | Use a TensorZero function as a judge |
exact_match | Evaluates whether the generated output exactly matches the reference output (skips the datapoint if unavailable). |
type: "exact_match"
type: "exact_match"
cutoff
- Type: float
- Required: no
type: "llm_judge"
type: "llm_judge"
input_format
- Type: string
- Required: no (default:
serialized
)
serialized
: Passes the input messages, generated output, and reference output (if included) as a single serialized string.messages
: Passes the input messages, generated output, and reference output (if included) as distinct messages in the conversation history.
We only support evaluations with image data when
input_format
is set to messages
.output_type
- Type: string
- Required: yes
float
: The judge is expected to return a floating-point number.boolean
: The judge is expected to return a boolean value.
include.reference_output
- Type: boolean
- Required: no (default:
false
)
true
, the reference output associated with the evaluation datapoint will be included in the input provided to the LLM judge.
In these cases, the evaluation run will not run this evaluator for datapoints where there is no reference output.optimize
- Type: string
- Required: yes
max
: Higher values are better.min
: Lower values are better.
cutoff
- Type: float
- Required: no
optimize
is max
) or above the cutoff (when optimize
is min
), the evaluations binary will return a nonzero status code.[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]
An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function.
Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our inference-time optimizations.You can include a standard variant configuration in this block, with two modifications:- Instead of assigning
weight
to each variant, you simply mark a single variant asactive
. - For
chat_completion
variants, instead of asystem_template
we requiresystem_instructions
as a text file and take no other templates.
active
- Type: boolean
- Required: Defaults to
true
if there is a single variant configured. Otherwise, this field is required to be set totrue
for exactly one variant.
system_instructions
- Type: string (path)
- Required: yes
{"thinking": "<thinking>", "score": <float or boolean>}
configured to the output_type
of the evaluator.evaluations/email-guardrails/check-signature/claude_35_sonnet/system_instructions.txt