See our Quickstart to learn how to set up our LLM gateway, observability, and fine-tuning — in just 5 minutes.
Status Quo
Imagine we have a TensorZero function for writing haikus about a given topic, and want to compare the behavior of GPT-4o and GPT-4o Mini on this task. Initially, our configuration for this function might look like:User Schema & Template
User Schema & Template
functions/write_haiku/user_schema.json
functions/write_haiku/user_template.minijinja
Datasets
To use TensorZero Evaluations, you first need to build a dataset. A dataset is a collection of datapoints. Each datapoint has an input and optionally a output. In the context of evaluations, the output in the dataset should be a reference output, i.e. the output you’d have liked to see. You don’t necessarily need to provide a reference output: some evaluators (e.g. LLM judges) can score generated outputs without a reference output (otherwise, that datapoint is skipped). Let’s create a dataset:- Generate many haikus by running inference on your
write_haiku
function. (On GitHub, we provide a scriptmain.py
that generates 100 haikus withwrite_haiku
.) - Open the UI, navigate to “Datasets”, and select “Build Dataset” (
http://localhost:4000/datasets/builder
). - Create a new dataset called
haiku_dataset
. Select yourwrite_haiku
function, “None” as the metric, and “Inference” as the dataset output.
See the Datasets & Datapoints API Reference to learn how to create and manage datasets programmatically.
Evaluations
Evalutions test the behavior of variants for a TensorZero function. Let’s define an evaluation in our configuration file:Evaluators
Each evaluation has one or more evaluators: a rule or behavior you’d like to test. Today, TensorZero supports two types of evaluators:exact_match
and llm_judge
.
We’re planning to release other types of evaluators soon (e.g. semantic similarity in an embedding space).
exact_match
The exact_match
evaluator compares the generated output with the datapoint’s reference output.
If they are identical, it returns true; otherwise, it returns false.
llm_judge
LLM Judges are special-purpose TensorZero function that can be used to evaluate a TensorZero function.
For example, our haikus should generally follow a specific format, but it’s hard to define a heuristic to determine if they’re correct.
Why not ask an LLM?
Let’s do that:
System Instructions
System Instructions
evaluations/haiku_eval/valid_haiku/system_instructions.txt
valid_haiku
of type llm_judge
, with a variant that uses GPT-4o Mini.
Similar to regular TensorZero functions, we can define multiple variants for an LLM judge.
But unlike regular functions, only one variant can be active at a time during evaluation; you can denote that with the active
property.
Example: Multiple Variants for an LLM Judge
Example: Multiple Variants for an LLM Judge
System Instructions
System Instructions
evaluations/haiku_eval/metaphor_count/system_instructions.txt
System Instructions
System Instructions
evaluations/haiku_eval/compare_haikus/system_instructions.txt
Running an Evaluation
Let’s run our evaluations! You can run evaluations using the TensorZero Evaluations CLI tool or the TensorZero UI.The TensorZero Evaluations CLI tool can be helpful for CI/CD.
It’ll exit with code 0 if all evaluations succeed (average score vs.
cutoff
), or code 1 otherwise.By default, TensorZero Evaluations uses Inference Caching to improve inference speed and cost.
CLI
To run evaluations in the CLI, you can use thetensorzero/evaluations
container:
Docker Compose
Docker Compose
Here’s the relevant section of the See GitHub for the complete Docker Compose configuration.
docker-compose.yml
for the evaluations tool.You should provide credentials for any LLM judges.
Alternatively, the evaluations tool can use an external TensorZero Gateway with the --gateway-url http://gateway:3000
flag.Docker Compose does not start this service with
docker compose up
since we have profiles: [evaluations]
.
You need to call it explicitly with docker compose run evaluations
, as desired.UI
To run evaluations in the UI, navigate to “Evaluations” (http://localhost:4000/evaluations
) and select “New Run”.
You can compare multiple evaluation runs in the TensorZero UI (including evaluation runs for the CLI).
