Tutorial: Inference Evaluations

This guide shows how to define and run inference evaluations for your TensorZero functions.

See our Quickstart to learn how to set up our LLM gateway, observability, and fine-tuning — in just 5 minutes.

You can find the code behind this tutorial and instructions on how to run it on GitHub.Reach out on Slack or Discord if you have any questions. We’d be happy to help!

Status Quo

Imagine we have a TensorZero function for writing haikus about a given topic, and want to compare the behavior of GPT-4o and GPT-4o Mini on this task. Initially, our configuration for this function might look like:

[functions.write_haiku]
type = "chat"
user_schema = "functions/write_haiku/user_schema.json"

[functions.write_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
user_template = "functions/write_haiku/user_template.minijinja"

[functions.write_haiku.variants.gpt_4o]
type = "chat_completion"
model = "openai::gpt-4o"
user_template = "functions/write_haiku/user_template.minijinja"

User Schema & Template

functions/write_haiku/user_schema.json

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "topic": {
      "type": "string"
    }
  },
  "required": ["topic"],
  "additionalProperties": false
}

functions/write_haiku/user_template.minijinja

Write a haiku about: {{ topic }}

How can we evaluate the behavior of our two variants in a principled way? One option is to build a dataset of “test cases” that we can evaluate them against.

Datasets

To use TensorZero Evaluations, you first need to build a dataset. A dataset is a collection of datapoints. Each datapoint has an input and optionally a output. In the context of evaluations, the output in the dataset should be a reference output, i.e. the output you’d have liked to see. You don’t necessarily need to provide a reference output: some evaluators (e.g. LLM judges) can score generated outputs without a reference output (otherwise, that datapoint is skipped). Let’s create a dataset:

Generate many haikus by running inference on your write_haiku function. (On GitHub, we provide a script main.py that generates 100 haikus with write_haiku.)
Open the UI, navigate to “Datasets”, and select “Build Dataset” (http://localhost:4000/datasets/builder).
Create a new dataset called haiku_dataset. Select your write_haiku function, “None” as the metric, and “Inference” as the dataset output.

See the Datasets & Datapoints API Reference to learn how to create and manage datasets programmatically.

Evaluations

Evalutions test the behavior of variants for a TensorZero function. Let’s define an evaluation in our configuration file:

[evaluations.haiku_eval]
type = "inference"
function_name = "write_haiku"

Evaluators

Each evaluation has one or more evaluators: a rule or behavior you’d like to test. Today, TensorZero supports two types of evaluators: exact_match and llm_judge.

We’re planning to release other types of evaluators soon (e.g. semantic similarity in an embedding space).

`exact_match`

The exact_match evaluator compares the generated output with the datapoint’s reference output. If they are identical, it returns true; otherwise, it returns false.

[evaluations.haiku_eval.evaluators.exact_match]
type = "exact_match"

`llm_judge`

LLM Judges are special-purpose TensorZero function that can be used to evaluate a TensorZero function. For example, our haikus should generally follow a specific format, but it’s hard to define a heuristic to determine if they’re correct. Why not ask an LLM? Let’s do that:

[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"  # LLM judge should generate a boolean (or float)
optimize = "max"  # higher is better
cutoff = 0.95  # if the variant scores <95% = bad

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"

System Instructions

evaluations/haiku_eval/valid_haiku/system_instructions.txt

Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.

Here, we defined an evaluator valid_haiku of type llm_judge, with a variant that uses GPT-4o Mini. Similar to regular TensorZero functions, we can define multiple variants for an LLM judge. But unlike regular functions, only one variant can be active at a time during evaluation; you can denote that with the active property.

Example: Multiple Variants for an LLM Judge

[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"
optimize = "max"
cutoff = 0.95

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
active = true

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_judge]
type = "chat_completion"
model = "openai::gpt-4o"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"

The LLM judge we showed above generates a boolean, but they can also generate floats. Let’s define another evalutor that counts the number of metaphors in our haiku.

[evaluations.haiku_eval.evaluators.metaphor_count]
type = "llm_judge"
output_type = "float"  # LLM judge should generate a boolean (or float)
optimize = "max"
cutoff = 1  # <1 metaphor per haiku = bad

We can also use different variant types for evaluators. Let’s use a chain-of-thought variant for our metaphor count evaluator, since it’s a bit more complex.

[evaluations.haiku_eval.evaluators.metaphor_count.variants.gpt_4o_mini_judge]
type = "experimental_chain_of_thought"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/metaphor_count/system_instructions.txt"
json_mode = "strict"

System Instructions

evaluations/haiku_eval/metaphor_count/system_instructions.txt

How many metaphors does the generated haiku have?

The LLM judges we’ve defined so far only look at the datapoint’s input and the generated output. But we can also provide the datapoint’s reference output to the judge:

[evaluations.haiku_eval.evaluators.compare_haikus]
type = "llm_judge"
include = { reference_output = true }  # include the reference output in the LLM judge's context
output_type = "boolean"
optimize = "max"

[evaluations.haiku_eval.evaluators.compare_haikus.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/compare_haikus/system_instructions.txt"
json_mode = "strict"

System Instructions

evaluations/haiku_eval/compare_haikus/system_instructions.txt

Does the generated haiku include the same figures of speech as the reference haiku?

Running an Evaluation

Let’s run our evaluations! You can run evaluations using the TensorZero Evaluations CLI tool or the TensorZero UI.

The TensorZero Evaluations CLI tool can be helpful for CI/CD. It’ll exit with code 0 if all evaluations succeed (average score vs. cutoff), or code 1 otherwise.

By default, TensorZero Evaluations uses Inference Caching to improve inference speed and cost.

CLI

To run evaluations in the CLI, you can use the tensorzero/evaluations container:

docker compose run --rm evaluations \
    --evaluation-name haiku_eval \
    --dataset-name haiku_dataset \
    --variant-name gpt_4o \
    --concurrency 5

Docker Compose

Here’s the relevant section of the docker-compose.yml for the evaluations tool.You should provide credentials for any LLM judges. Alternatively, the evaluations tool can use an external TensorZero Gateway with the --gateway-url http://gateway:3000 flag.

services:
  # ...

  evaluations:
    profiles: [evaluations]
    image: tensorzero/evaluations
    volumes:
      - ./config:/app/config:ro
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
      # ... and any other relevant API credentials ...
      - TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      clickhouse:
        condition: service_healthy
# ...

See GitHub for the complete Docker Compose configuration.

Docker Compose does not start this service with docker compose up since we have profiles: [evaluations]. You need to call it explicitly with docker compose run evaluations, as desired.

UI

To run evaluations in the UI, navigate to “Evaluations” (http://localhost:4000/evaluations) and select “New Run”. You can compare multiple evaluation runs in the TensorZero UI (including evaluation runs for the CLI).

Introduction

Gateway

Optimization

Evaluations

Experimentation

Deployment

Operations

Tutorial: Inference Evaluations

Status Quo

Datasets

Evaluations

Evaluators

`exact_match`

`llm_judge`

Running an Evaluation

CLI

UI

Introduction

Gateway

Optimization

Evaluations

Experimentation

Deployment

Operations

​Status Quo

​Datasets

​Evaluations

​Evaluators

​exact_match

​llm_judge

​Running an Evaluation

​CLI

​UI

Status Quo

Datasets

Evaluations

Evaluators

`exact_match`

`llm_judge`

Running an Evaluation

CLI

UI