This guide shows how to define and run static evaluations for your TensorZero functions.
See our Quickstart to learn how to set up our LLM gateway, observability, and fine-tuning — in just 5 minutes.
You can find the code behind this tutorial and instructions on how to run it on GitHub.Reach out on Slack or Discord if you have any questions. We’d be happy to help!

Status Quo

Imagine we have a TensorZero function for writing haikus about a given topic, and want to compare the behavior of GPT-4o and GPT-4o Mini on this task. Initially, our configuration for this function might look like:
[functions.write_haiku]
type = "chat"
user_schema = "functions/write_haiku/user_schema.json"

[functions.write_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
user_template = "functions/write_haiku/user_template.minijinja"

[functions.write_haiku.variants.gpt_4o]
type = "chat_completion"
model = "openai::gpt-4o"
user_template = "functions/write_haiku/user_template.minijinja"
How can we evaluate the behavior of our two variants in a principled way? One option is to build a dataset of “test cases” that we can evaluate them against.

Datasets

To use TensorZero Evaluations, you first need to build a dataset. A dataset is a collection of datapoints. Each datapoint has an input and optionally a output. In the context of evaluations, the output in the dataset should be a reference output, i.e. the output you’d have liked to see. You don’t necessarily need to provide a reference output: some evaluators (e.g. LLM judges) can score generated outputs without a reference output (otherwise, that datapoint is skipped). Let’s create a dataset:
  1. Generate many haikus by running inference on your write_haiku function. (On GitHub, we provide a script main.py that generates 100 haikus with write_haiku.)
  2. Open the UI, navigate to “Datasets”, and select “Build Dataset” (http://localhost:4000/datasets/builder).
  3. Create a new dataset called haiku_dataset. Select your write_haiku function, “None” as the metric, and “Inference” as the dataset output.
See the Datasets & Datapoints API Reference to learn how to create and manage datasets programmatically.

Evaluations

Evalutions test the behavior of variants for a TensorZero function. Let’s define an evaluation in our configuration file:
[evaluations.haiku_eval]
type = "static"
function_name = "write_haiku"

Evaluators

Each evaluation has one or more evaluators: a rule or behavior you’d like to test. Today, TensorZero supports two types of evaluators: exact_match and llm_judge.
We’re planning to release other types of evaluators soon (e.g. semantic similarity in an embedding space).

exact_match

The exact_match evaluator compares the generated output with the datapoint’s reference output. If they are identical, it returns true; otherwise, it returns false.
[evaluations.haiku_eval.evaluators.exact_match]
type = "exact_match"

llm_judge

LLM Judges are special-purpose TensorZero function that can be used to evaluate a TensorZero function. For example, our haikus should generally follow a specific format, but it’s hard to define a heuristic to determine if they’re correct. Why not ask an LLM? Let’s do that:
[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"  # LLM judge should generate a boolean (or float)
optimize = "max"  # higher is better
cutoff = 0.95  # if the variant scores <95% = bad

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
Here, we defined an evaluator valid_haiku of type llm_judge, with a variant that uses GPT-4o Mini. Similar to regular TensorZero functions, we can define multiple variants for an LLM judge. But unlike regular functions, only one variant can be active at a time during evaluation; you can denote that with the active property.
The LLM judge we showed above generates a boolean, but they can also generate floats. Let’s define another evalutor that counts the number of metaphors in our haiku.
[evaluations.haiku_eval.evaluators.metaphor_count]
type = "llm_judge"
output_type = "float"  # LLM judge should generate a boolean (or float)
optimize = "max"
cutoff = 1  # <1 metaphor per haiku = bad
We can also use different variant types for evaluators. Let’s use a chain-of-thought variant for our metaphor count evaluator, since it’s a bit more complex.
[evaluations.haiku_eval.evaluators.metaphor_count.variants.gpt_4o_mini_judge]
type = "experimental_chain_of_thought"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/metaphor_count/system_instructions.txt"
json_mode = "strict"
The LLM judges we’ve defined so far only look at the datapoint’s input and the generated output. But we can also provide the datapoint’s reference output to the judge:
[evaluations.haiku_eval.evaluators.compare_haikus]
type = "llm_judge"
include = { reference_output = true }  # include the reference output in the LLM judge's context
output_type = "boolean"
optimize = "max"

[evaluations.haiku_eval.evaluators.compare_haikus.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/compare_haikus/system_instructions.txt"
json_mode = "strict"

Running an Evaluation

Let’s run our evaluations! You can run evaluations using the TensorZero Evaluations CLI tool or the TensorZero UI.
The TensorZero Evaluations CLI tool can be helpful for CI/CD. It’ll exit with code 0 if all evaluations succeed (average score vs. cutoff), or code 1 otherwise.
By default, TensorZero Evaluations uses Inference Caching to improve inference speed and cost.

CLI

To run evaluations in the CLI, you can use the tensorzero/evaluations container:
docker compose run --rm evaluations \
    --evaluation-name haiku_eval \
    --dataset-name haiku_dataset \
    --variant-name gpt_4o \
    --concurrency 5

UI

To run evaluations in the UI, navigate to “Evaluations” (http://localhost:4000/evaluations) and select “New Run”. You can compare multiple evaluation runs in the TensorZero UI (including evaluation runs for the CLI). TensorZero Evaluation UI