Dynamic evaluations enable you to evaluate complex workflows that combine multiple inference calls with arbitrary application logic. Here, we’ll walk through a stylized RAG workflow to illustrate the process of setting up and running a dynamic evaluation, but the same process can be applied to any complex workflow. Imagine we have the following LLM-powered workflow in response to a natural-language question from a user:
  1. Inference: Call the generate_database_query TensorZero function to generate a database query from the user’s question.
  2. Custom Logic: Run the database query against a database and retrieve the results (my_blackbox_search_function).
  3. Inference: Call the generate_final_answer TensorZero function to generate an answer from the retrieved results.
  4. Custom Logic: Score the answer using a custom scoring function (my_blackbox_scoring_function)
  5. Feedback: Send feedback using the task_success metric.
Evaluating generate_database_query and generate_final_answer in a vacuum (i.e. using static evaluations) can also be helpful, but ultimately we want to evaluate the entire workflow end-to-end. This is where dynamic evaluations come in. Complex LLM applications might need to make multiple LLM calls and execute arbitrary code before giving an overall result. In agentic applications, the workflow might even be defined dynamically at runtime based on the user’s input, the results of the LLM calls, or other factors. Dynamic evaluations in TensorZero provide complete flexibility and enable you to evaluate the entire workflow jointly. You can think of them like integration tests for your LLM applications.
For a more complex, runnable example, see the Dynamic Evaluations for Agentic RAG Example on GitHub.

Starting a dynamic evaluation run

Evaluating the workflow above involves tackling and evaluating a collection of tasks (e.g. user queries). Each individual task corresponds to an episode, and the collection of these episodes is a dynamic evaluation run.
First, let’s initialize the TensorZero client (just like you would for typical inference requests):
from tensorzero import TensorZeroGateway

# Initialize the client with `build_http` or `build_embedded`
with TensorZeroGateway.build_http(
    gateway_url="http://localhost:3000",
) as t0:
    # ...
Now you can start a dynamic evaluation run.During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate). This allows you to see the effects of different combinations of variants on the end-to-end system’s performance.
You don’t have to specify a variant for every function you use; if you don’t specify a variant, the TensorZero Gateway will sample a variant for you as it normally would.
You can optionally also specify a project_name and display_name for the run. If you specify a project_name, you’ll be able to compare this run against other runs for that project using the TensorZero UI. The display_name is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.
run_info = t0.dynamic_evaluation_run(
    # Assume we have these variants defined in our `tensorzero.toml` configuration file
    variants={
        "generate_database_query": "o4_mini_prompt_baseline",
        "generate_final_answer": "gpt_4o_updated_prompt",
    },
    project_name="simple_rag_project",
    display_name="generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt",
)
The TensorZero client automatically tags your dynamic evaluation runs with information about your Git repository if available (e.g. branch name, commit hash). This metadata is displayed in the TensorZero UI so that you have a record of the code that was used to run the dynamic evaluation. We recommend that you commit your changes before running a dynamic evaluation so that the Git state is accurately captured.

Starting an episode in a dynamic evaluation run

For each task (e.g. datapoint) we want to include in our dynamic evaluation run, we need to start an episode. For example, in our agentic RAG project, each episode will correspond to a user query from our dataset; each user query requires multiple inference calls and application logic to run.
To initialize an episode, you need to provide the run_id of the dynamic evaluation run you want to include the episode in. You can optionally also specify a task_name for the episode. If you specify a task_name, you’ll be able to compare this episode against episodes for that task from other runs using the TensorZero UI. We encourage you to use the task_name to provide a meaningful identifier for the task that the episode is tackling.
episode_info = t0.dynamic_evaluation_run_episode(
    run_id=run_info.run_id,
    task_name="user_query_123",
)
Now we can use episode_info.episode_id to make inference and feedback calls.

Making inference and feedback calls during a dynamic evaluation run

See our Quickstart to learn how to set up our LLM gateway, observability, and fine-tuning — in just 5 minutes.
You can also use the OpenAI SDK for inference calls. See the Quickstart for more details.(Similarly, you can also use dynamic evaluations with any framework or agent that is OpenAI-compatible by passing along the episode ID and function name in the request to TensorZero.)
generate_database_query_response = t0.inference(
    function_name="generate_database_query",
    episode_id=episode_info.episode_id,
    input={ ... },
)

search_result = my_blackbox_search_function(generate_database_query_response)

generate_final_answer_response = t0.inference(
    function_name="generate_final_answer",
    episode_id=episode_info.episode_id,
    input={ ... },
)

task_success_score = my_blackbox_scoring_function(generate_final_answer_response)

t0.feedback(
    metric_name="task_success",
    episode_id=episode_info.episode_id,
    value=task_success_score,
)

Visualizing evaluation results in the TensorZero UI

Once you finish running all the relevant episodes for your dynamic evaluation run, you can visualize the results in the TensorZero UI. In the UI, you can compare metrics across evaluation runs, inspect individual episodes and inferences, and more. Dynamic Evaluation Run Results in the TensorZero UI