Tutorial: Dynamic Evaluations
Dynamic evaluations enable you to evaluate complex workflows that combine multiple inference calls with arbitrary application logic. Here, we’ll walk through a stylized RAG workflow to illustrate the process of setting up and running a dynamic evaluation, but the same process can be applied to any complex workflow.
Imagine we have the following LLM-powered workflow in response to a natural-language question from a user:
- Inference: Call the
generate_database_query
TensorZero function to generate a database query from the user’s question. - Custom Logic: Run the database query against a database and retrieve the results (
my_blackbox_search_function
). - Inference: Call the
generate_final_answer
TensorZero function to generate an answer from the retrieved results. - Custom Logic: Score the answer using a custom scoring function (
my_blackbox_scoring_function
) - Feedback: Send feedback using the
task_success
metric.
Evaluating generate_database_query
and generate_final_answer
in a vacuum (i.e. using static evaluations) can also be helpful, but ultimately we want to evaluate the entire workflow end-to-end.
This is where dynamic evaluations come in.
Complex LLM applications might need to make multiple LLM calls and execute arbitrary code before giving an overall result. In agentic applications, the workflow might even be defined dynamically at runtime based on the user’s input, the results of the LLM calls, or other factors. Dynamic evaluations in TensorZero provide complete flexibility and enable you to evaluate the entire workflow jointly. You can think of them like integration tests for your LLM applications.
Starting a dynamic evaluation run
Evaluating the workflow above involves tackling and evaluating a collection of tasks (e.g. user queries). Each individual task corresponds to an episode, and the collection of these episodes is a dynamic evaluation run.
First, let’s initialize the TensorZero client (just like you would for typical inference requests):
from tensorzero import TensorZeroGateway
# Initialize the client with `build_http` or `build_embedded`with TensorZeroGateway.build_http( gateway_url="http://localhost:3000",) as t0: # ...
Now you can start a dynamic evaluation run.
During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate). This allows you to see the effects of different combinations of variants on the end-to-end system’s performance.
You can optionally also specify a project_name
and display_name
for the run.
If you specify a project_name
, you’ll be able to compare this run against other runs for that project using the TensorZero UI.
The display_name
is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.
run_info = t0.dynamic_evaluation_run( # Assume we have these variants defined in our `tensorzero.toml` configuration file variants={ "generate_database_query": "o4_mini_prompt_baseline", "generate_final_answer": "gpt_4o_updated_prompt", }, project_name="simple_rag_project", display_name="generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt",)
First, let’s initialize the TensorZero client (just like you would for typical inference requests):
from tensorzero import AsyncTensorZeroGateway
# Initialize the client with `build_http` or `build_embedded`async with await AsyncTensorZeroGateway.build_http( gateway_url="http://localhost:3000",) as t0: # ...
Now you can start a dynamic evaluation run.
During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate). This allows you to see the effects of different combinations of variants on the end-to-end system’s performance.
You can optionally also specify a project_name
and display_name
for the run.
If you specify a project_name
, you’ll be able to compare this run against other runs for that project using the TensorZero UI.
The display_name
is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.
run_info = await t0.dynamic_evaluation_run( # Assume we have these variants defined in our `tensorzero.toml` configuration file variants={ "generate_database_query": "o4_mini_prompt_baseline", "generate_final_answer": "gpt_4o_updated_prompt", }, project_name="simple_rag_project", display_name="generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt",)
During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate). This allows you to see the effects of different combinations of variants on the end-to-end system’s performance.
You can optionally also specify a project_name
and display_name
for the run.
If you specify a project_name
, you’ll be able to compare this run against other runs for that project using the TensorZero UI.
The display_name
is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.
curl -X POST http://localhost:3000/dynamic_evaluation_run \ -H "Content-Type: application/json" \ -d '{ "variants": { "generate_database_query": "o4_mini_prompt_baseline", "generate_final_answer": "gpt_4o_updated_prompt" }, "project_name": "simple_rag_project", "display_name": "generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt" }'
Starting an episode in a dynamic evaluation run
For each task (e.g. datapoint) we want to include in our dynamic evaluation run, we need to start an episode. For example, in our agentic RAG project, each episode will correspond to a user query from our dataset; each user query requires multiple inference calls and application logic to run.
To initialize an episode, you need to provide the run_id
of the dynamic evaluation run you want to include the episode in.
You can optionally also specify a task_name
for the episode.
If you specify a task_name
, you’ll be able to compare this episode against episodes for that task from other runs using the TensorZero UI.
We encourage you to use the task_name
to provide a meaningful identifier for the task that the episode is tackling.
episode_info = t0.dynamic_evaluation_run_episode( run_id=run_info.run_id, task_name="user_query_123",)
Now we can use episode_info.episode_id
to make inference and feedback calls.
To initialize an episode, you need to provide the run_id
of the dynamic evaluation run you want to include the episode in.
You can optionally also specify a task_name
for the episode.
If you specify a task_name
, you’ll be able to compare this episode against episodes for that task from other runs using the TensorZero UI.
episode_info = await t0.dynamic_evaluation_run_episode( run_id=run_info.run_id, task_name="user_query_123",)
Now we can use episode_info.episode_id
to make inference and feedback calls.
To initialize an episode, you need to provide the run_id
of the dynamic evaluation run you want to include the episode in.
You can optionally also specify a task_name
for the episode.
If you specify a task_name
, you’ll be able to compare this episode against episodes for that task from other runs using the TensorZero UI.
curl -X POST http://localhost:3000/dynamic_evaluation_run/{run_id}/episode \ -H "Content-Type: application/json" \ -d '{ "task_name": "user_query_123" }'
Now we can use episode_info.episode_id
to make inference and feedback calls.
Making inference and feedback calls during a dynamic evaluation run
generate_database_query_response = t0.inference( function_name="generate_database_query", episode_id=episode_info.episode_id, input={ ... },)
search_result = my_blackbox_search_function(generate_database_query_response)
generate_final_answer_response = t0.inference( function_name="generate_final_answer", episode_id=episode_info.episode_id, input={ ... },)
task_success_score = my_blackbox_scoring_function(generate_final_answer_response)
t0.feedback( metric_name="task_success", episode_id=episode_info.episode_id, value=task_success_score,)
generate_database_query_response = await t0.inference( function_name="generate_database_query", episode_id=episode_info.episode_id, input={ ... },)
search_result = my_blackbox_search_function(generate_database_query_response)
generate_final_answer_response = await t0.inference( function_name="generate_final_answer", episode_id=episode_info.episode_id, input={ ... },)
task_success_score = my_blackbox_scoring_function(generate_final_answer_response)
await t0.feedback( metric_name="task_success", episode_id=episode_info.episode_id, value=task_success_score,)
# First inference callcurl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "generate_database_query", "episode_id": "00000000-0000-0000-0000-000000000000", "input": { ... } }'
# Run your custom search function with the result...my_blackbox_search_function(...)
# Second inference callcurl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "generate_final_answer", "episode_id": "00000000-0000-0000-0000-000000000000", "input": { ... } }'
# Run your custom scoring function with the result...my_blackbox_scoring_function(...)
# Feedback callcurl -X POST http://localhost:3000/feedback \ -H "Content-Type: application/json" \ -d '{ "metric_name": "task_success", "episode_id": "00000000-0000-0000-0000-000000000000", "value": 0.85 }'
Visualizing evaluation results in the TensorZero UI
Once you finish running all the relevant episodes for your dynamic evaluation run, you can visualize the results in the TensorZero UI.
In the UI, you can compare metrics across evaluation runs, inspect individual episodes and inferences, and more.