Tutorial: Dynamic Evaluations

Dynamic evaluations enable you to evaluate complex workflows that combine multiple inference calls with arbitrary application logic. Here, we’ll walk through a stylized RAG workflow to illustrate the process of setting up and running a dynamic evaluation, but the same process can be applied to any complex workflow.

Imagine we have the following LLM-powered workflow in response to a natural-language question from a user:

Inference: Call the generate_database_query TensorZero function to generate a database query from the user’s question.
Custom Logic: Run the database query against a database and retrieve the results (my_blackbox_search_function).
Inference: Call the generate_final_answer TensorZero function to generate an answer from the retrieved results.
Custom Logic: Score the answer using a custom scoring function (my_blackbox_scoring_function)
Feedback: Send feedback using the task_success metric.

Evaluating generate_database_query and generate_final_answer in a vacuum (i.e. using static evaluations) can also be helpful, but ultimately we want to evaluate the entire workflow end-to-end. This is where dynamic evaluations come in.

Complex LLM applications might need to make multiple LLM calls and execute arbitrary code before giving an overall result. In agentic applications, the workflow might even be defined dynamically at runtime based on the user’s input, the results of the LLM calls, or other factors. Dynamic evaluations in TensorZero provide complete flexibility and enable you to evaluate the entire workflow jointly. You can think of them like integration tests for your LLM applications.

Starting a dynamic evaluation run

Evaluating the workflow above involves tackling and evaluating a collection of tasks (e.g. user queries). Each individual task corresponds to an episode, and the collection of these episodes is a dynamic evaluation run.

First, let’s initialize the TensorZero client (just like you would for typical inference requests):

from tensorzero import TensorZeroGateway

# Initialize the client with `build_http` or `build_embedded`
with TensorZeroGateway.build_http(
    gateway_url="http://localhost:3000",
) as t0:
    # ...

Now you can start a dynamic evaluation run.

During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate). This allows you to see the effects of different combinations of variants on the end-to-end system’s performance.

You can optionally also specify a project_name and display_name for the run. If you specify a project_name, you’ll be able to compare this run against other runs for that project using the TensorZero UI. The display_name is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.

run_info = t0.dynamic_evaluation_run(
    # Assume we have these variants defined in our `tensorzero.toml` configuration file
    variants={
        "generate_database_query": "o4_mini_prompt_baseline",
        "generate_final_answer": "gpt_4o_updated_prompt",
    },
    project_name="simple_rag_project",
    display_name="generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt",
)

First, let’s initialize the TensorZero client (just like you would for typical inference requests):

from tensorzero import AsyncTensorZeroGateway

# Initialize the client with `build_http` or `build_embedded`
async with await AsyncTensorZeroGateway.build_http(
    gateway_url="http://localhost:3000",
) as t0:
    # ...

Now you can start a dynamic evaluation run.

run_info = await t0.dynamic_evaluation_run(
    # Assume we have these variants defined in our `tensorzero.toml` configuration file
    variants={
        "generate_database_query": "o4_mini_prompt_baseline",
        "generate_final_answer": "gpt_4o_updated_prompt",
    },
    project_name="simple_rag_project",
    display_name="generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt",
)

curl -X POST http://localhost:3000/dynamic_evaluation_run \
  -H "Content-Type: application/json" \
  -d '{
    "variants": {
      "generate_database_query": "o4_mini_prompt_baseline",
      "generate_final_answer": "gpt_4o_updated_prompt"
    },
    "project_name": "simple_rag_project",
    "display_name": "generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt"
  }'

Starting an episode in a dynamic evaluation run

For each task (e.g. datapoint) we want to include in our dynamic evaluation run, we need to start an episode. For example, in our agentic RAG project, each episode will correspond to a user query from our dataset; each user query requires multiple inference calls and application logic to run.

To initialize an episode, you need to provide the run_id of the dynamic evaluation run you want to include the episode in. You can optionally also specify a task_name for the episode. If you specify a task_name, you’ll be able to compare this episode against episodes for that task from other runs using the TensorZero UI. We encourage you to use the task_name to provide a meaningful identifier for the task that the episode is tackling.

episode_info = t0.dynamic_evaluation_run_episode(
    run_id=run_info.run_id,
    task_name="user_query_123",
)

Now we can use episode_info.episode_id to make inference and feedback calls.

episode_info = await t0.dynamic_evaluation_run_episode(
    run_id=run_info.run_id,
    task_name="user_query_123",
)

Now we can use episode_info.episode_id to make inference and feedback calls.

curl -X POST http://localhost:3000/dynamic_evaluation_run/{run_id}/episode \
  -H "Content-Type: application/json" \
  -d '{
    "task_name": "user_query_123"
  }'

Now we can use episode_info.episode_id to make inference and feedback calls.

Making inference and feedback calls during a dynamic evaluation run

generate_database_query_response = t0.inference(
    function_name="generate_database_query",
    episode_id=episode_info.episode_id,
    input={ ... },
)

search_result = my_blackbox_search_function(generate_database_query_response)

generate_final_answer_response = t0.inference(
    function_name="generate_final_answer",
    episode_id=episode_info.episode_id,
    input={ ... },
)

task_success_score = my_blackbox_scoring_function(generate_final_answer_response)

t0.feedback(
    metric_name="task_success",
    episode_id=episode_info.episode_id,
    value=task_success_score,
)

generate_database_query_response = await t0.inference(
    function_name="generate_database_query",
    episode_id=episode_info.episode_id,
    input={ ... },
)

search_result = my_blackbox_search_function(generate_database_query_response)

generate_final_answer_response = await t0.inference(
    function_name="generate_final_answer",
    episode_id=episode_info.episode_id,
    input={ ... },
)

task_success_score = my_blackbox_scoring_function(generate_final_answer_response)

await t0.feedback(
    metric_name="task_success",
    episode_id=episode_info.episode_id,
    value=task_success_score,
)

# First inference call
curl -X POST http://localhost:3000/inference \
  -H "Content-Type: application/json" \
  -d '{
    "function_name": "generate_database_query",
    "episode_id": "00000000-0000-0000-0000-000000000000",
    "input": { ... }
  }'

# Run your custom search function with the result...
my_blackbox_search_function(...)

# Second inference call
curl -X POST http://localhost:3000/inference \
  -H "Content-Type: application/json" \
  -d '{
    "function_name": "generate_final_answer",
    "episode_id": "00000000-0000-0000-0000-000000000000",
    "input": { ... }
  }'

# Run your custom scoring function with the result...
my_blackbox_scoring_function(...)

# Feedback call
curl -X POST http://localhost:3000/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "metric_name": "task_success",
    "episode_id": "00000000-0000-0000-0000-000000000000",
    "value": 0.85
  }'

Visualizing evaluation results in the TensorZero UI

Once you finish running all the relevant episodes for your dynamic evaluation run, you can visualize the results in the TensorZero UI.

In the UI, you can compare metrics across evaluation runs, inspect individual episodes and inferences, and more.

Dynamic Evaluation Run Results in the TensorZero UI