See our Quickstart to learn how to set up our LLM gateway, observability, and fine-tuning — in just 5 minutes.
Part I — Simple Chatbot
We’ll start by building a vanilla LLM-powered chatbot, and build up to more complex applications from there.Functions
A TensorZero Function is an abstract mapping from input variables to output variables. As you onboard to TensorZero, a function should replace each prompt in your system. At a high level, a function will template the inputs to generate a prompt, make an LLM inference call, and return the results. This mapping can be achieved with various choices of model, prompt, decoding strategy, and more; each such combination is called a variant, which we’ll discuss below. For our simple chatbot, we’ll set up a function that maps the chat history to a new chat message. We define functions in thetensorzero.toml
configuration file.
The configuration file is written in TOML, which is a simple configuration language.
Read more about the TOML syntax in the TOML documentation.
tensorzero.toml
chat
functions, which match the typical chat interface you’d expect from an LLM API, and json
functions, which are optimized for generating structured outputs.
We’ll start with a chat
function for this example, and later we’ll see how to use json
functions.
A chat
function takes a chat message history and returns a chat message.
It doesn’t have any required fields (but many optional).
Let’s call our function mischievous_chatbot
and set its type to chat
.
We’ll ignore the optional fields for now.
To include these changes, our tensorzero.toml
file should include the following:
tensorzero.toml
variants
.
But before we can define a variant, we need to set up a model and a model provider.
Models and Model Providers
Before setting up your first TensorZero variant, you’ll need a model with a model provider. A model specifies a particular LLM (e.g. GPT-4o or your fine-tuned Llama 3), and model providers specify the different ways you can access a given model (e.g. GPT-4o is available through both OpenAI and Azure). A model has an arbitrary name and a list of providers. Let’s start with a single provider for our model. A provider has an arbitrary name, a type, and other fields that depend on the provider type. The skeleton of a model and its provider looks like this:tensorzero.toml
my_gpt_4o_mini
and our provider my_openai_provider
with type openai
.
The only required field for the openai
provider is model_name
.
It’s a best practice to pin the model to a specific version to avoid breaking changes, so we’ll use gpt-4o-mini-2024-07-18
.
TensorZero supports proprietary models (e.g. OpenAI, Anthropic), inference services (e.g. Fireworks AI, Together AI), and self-hosted LLMs (e.g. vLLM), including your own fine-tuned models on each of these.See Integrations and Configuration Reference for more details.
tensorzero.toml
file should include the following:
tensorzero.toml
You can add multiple providers for the same model to enable fallbacks.
The gateway will try each provider in the
routing
field in order until one succeeds.
This is helpful to mitigate the impact of provider downtime and rate limiting.Variants
Now that we have a model and a provider configured, we can create a variant for ourmischievous_chatbot
function.
A variant is a particular implementation of a function.
In practice, a variant might specify the particular model, prompt templates, a decoding strategy, hyperparameters, and other settings used for inference.
A variant’s definition includes an arbitrary name, a type, a weight, and other fields that depend on the type.
The skeleton of a TensorZero variant looks like this:
tensorzero.toml
gpt_4o_mini_variant
.
The simplest variant type
is chat_completion
, which is the typical chat completion format used by OpenAI and many other LLM providers.
TensorZero supports other variant types which implement inference-time optimizations.
See Configuration Reference for more details on variant types and their configuration options.
weight
field is used to determine the probability of this variant being chosen.
Since we only have one variant, we’ll give it a weight of 1.0
.
We’ll dive deeper into variant weights in a later section.
The only required field for a chat_completion
variant is model
.
This must be a model in the configuration file.
We’ll use the my_gpt_4o_mini
model we defined earlier.
After filling in the fields for this variant, our tensorzero.toml
file should include the following:
tensorzero.toml
If you don’t require advanced functionality for model providers (e.g. Retries & Fallbacks), you don’t have to define
model
configuration blocks.
TensorZero supports short-hand model names like openai::gpt-4o-mini
or anthropic::claude-3-5-haiku
in a variant’s model
field.
See Configuration Reference for more details.Inference API Requests
There’s a lot more to TensorZero than what we’ve covered so far, but this is everything we need to get started! If you launch the TensorZero Gateway with this configuration file, themischievous_chatbot
function will be available on the /inference
endpoint.
Let’s make a request to this endpoint.
You can install the TensorZero Python client with:Then, you can make a TensorZero API call with:
POST /inference
Sample Output
Sample Output
The TensorZero Gateway also supports streaming inference. See the API Reference for more details.
Earlier we mentioned that you can add multiple providers for the same model to enable model fallbacks.
TensorZero additionally supports variant fallbacks.
The gateway first tries to fallback to a different provider for the same model.
If all providers for a variant are unavailable, the gateway will keep re-sampling variants (without replacement) until one succeeds.
Part II — Email Copilot
Next, let’s build an LLM-powered copilot for drafting emails. We’ll use this opportunity to show off more of TensorZero’s features.Templates
In the previous example, we provided a system prompt on every request. Unless the system prompt completely changes between requests, this is not ideal for production applications. Instead, we can use a system template. Using a template allows you to update the prompt without client-side changes. Later, we’ll see how to parametrize templates with schemas and run robust prompt experiments with multiple variants. In particular, setting up schemas will materially help you optimize your models robustly down the road. Let’s start with a simple system template. For this example, the system template is static, so you won’t need a schema. TensorZero uses MiniJinja for templating. Since we’re not using any variables, however, we don’t need any special syntax.Read more about MiniJinja syntax in the MiniJinja documentation.
MiniJinja is similar to Jinja2 but there are a few differences. See their compatibility guide for more details.MiniJinja also provides a browser playground where you can test your templates.
system.minijinja
Schemas
The system template for this example is static, but often you’ll want to parametrize the prompts. When you define a template with parameters, you need to define a corresponding JSON Schema. The schema defines the structure of the input for that prompt. With it, the gateway can validate the input before running the inference, and later, we’ll see how to use it for robust model optimization. For our email copilot’s user prompt, we’ll want to parametrize the template with three string fields:recipient_name
, sender_name
, and email_purpose
.
We want all fields to be required and don’t want any additional fields.
Ask your favorite LLM to generate the schema for you.Claude generated the schema for this example using this request:
user_schema.json
You can export JSON Schemas from Pydantic models and Zod schemas.
user.minijinja
Functions with Templates and Schemas
Let’s finally create our function and variant for the email copilot.Schemas belong to functions and templates belong to variants.
Think of this like a function signature vs. method implementation when programming.The same schema can be used by multiple templates, but the schema itself should not change over time.
We recommend simply copying the function and renaming it if you want to change the signature if you’ve already used it in production.
Our roadmap includes better support for schema versioning and migrations.
user_schema
field to the function and system_template
and user_template
fields to the variant.
tensorzero.toml
inference_id
and an episode_id
, which we’ll use later to associate feedback with inferences.
POST /inference
Sample Output
Sample Output
You can find the full code to reproduce this example on GitHub.
Inference-Level Metrics
The TensorZero Gateway allows you to assign feedback to inferences or sequences of inferences by defining metrics. Metrics encapsulate the downstream outcomes of your LLM application, and drive the experimentation and optimization workflows in TensorZero. This example covers metrics that apply to individual inference requests. Later, we’ll show how to define metrics that apply to sequences of inferences (which we call episodes). The skeleton of a metric looks like the following configuration entry.tensorzero.toml
email_draft_accepted
.
We should use a metric of type boolean
to capture this behavior since we’re optimizing for a binary outcome: whether the email draft is accepted or not.
We currently support the following Metric types:
See the Configuration Reference for more details about how to configure metrics.
Metric Type | Description | Examples |
---|---|---|
Boolean Metric | A boolean indicating success | Thumbs up; task success |
Float Metric | A number to be optimized | Mistakes; interactions; resources used |
Comment | Natural-language feedback | Feedback from users or developers |
Demonstration | Example of desired output | Edited drafts; labels; human-generated content |
level = "inference"
.
And finally, we’ll set optimize = "max"
because we want to maximize this metric.
Our metric configuration should look like this:
tensorzero.toml
Feedback API Requests
As our application collects usage data, we can use the/feedback
endpoint to keep track of this metric.
Make sure to restart your gateway after adding the metric configuration.
Previously, we saw that every time you call /inference
, the Gateway will return an inference_id
field in the response.
You’ll want to substitute this inference_id
into the command below.
POST /feedback
Sample Output
Sample Output
Experimentation
So far, we’ve only used one variant of our function. In practice, you’ll want to experiment with different configurations — for example, different prompts, models, and parameters. TensorZero makes this easy with built-in experimentation features. You can define multiple variants of a function, and the gateway will sample from them at inference time.For now you must manage variant weights yourself, but we’re planning to release an asynchronous multi-armed bandit algorithm we’ve implemented for robust automated experimentation.
temperature
parameter to control the creativity of the AI assistant.
Let’s start by adding a new model and provider.
tensorzero.toml
tensorzero.toml
Weights don’t have to add up to 1.0.
In such a case, the gateway will normalize the weights and sample accordingly.
Part III — Weather RAG
The next example introduces tool use into the mix.Some providers call this feature “Function Calling”.
But don’t confuse it with TensorZero Functions — those are completely different concepts!
You can also use TensorZero to manage more complex RAG workflows.
We’ll soon release an example featuring an agentic workflow with multi-hop retrieval and reasoning.
generate_weather_query
), and another for response generation (generate_weather_report
).
The former will leverage tool (get_temperature
) use to generate a weather query.
Here we mock the weather API, but it’ll be easy to see how diverse RAG workflows can be integrated.
Tools
TensorZero has first-class support for tools. You can define a tool in your configuration file, and attach it to a function that should be allowed to call it. Let’s start by defining a tool. A tool has a name, a description, and a set of parameters (described with a JSON schema). The skeleton of a tool configuration looks like this:tensorzero.toml
location
(string) and units
(enum with values fahrenheit
and celsius
).
Only location
is required, no additional properties should be allowed.
Finally, we’ll add descriptions for each parameter and tool itself — this is very important to increase the quality of tool use!
get_temperature.json
tensorzero.toml
Functions with Tool Use
We can now create our two functions. The query generation function will use the tool we just defined, and the response generation function will be similar to our previous examples. Let’s define the functions, their variants, and any associated templates and schemas.tensorzero.toml
Both functions have a variant called
simple_variant
, but those are separate variants.
A variant is always specific to a function.
Multiple variants, however, can share the same model.functions/generate_weather_query/simple_variant/system.minijinja
functions/generate_weather_report/simple_variant/system.minijinja
functions/generate_weather_report/user_schema.json
functions/generate_weather_report/simple_variant/user.minijinja
TensorZero also supports multi-turn tool use, parallel tool calls, tool choice strategies, dynamic tool definition, and more.
See the Configuration Reference for more information.Notably, another approach to our weather RAG example is to use a single function for both query and response generation (i.e. multi-turn tool use).
As an exercise, why don’t you try implementing it? See Chat Function with Multi-Turn Tool Use for an example.
tool_call
content block.
These content blocks have the fields arguments
, name
, raw_arguments
, and raw_name
.
The first two fields are validated against the tool’s configuration (or null
if invalid).
The last two fields contain the raw values passed received from the model.
Episodes
Before we make any inference requests, we must introduce one more concept: episodes. An episode is a sequence of inferences associated with a common downstream outcome. For example, an episode could refer to a sequence of LLM calls associated with:- Resolving a support ticket
- Preparing an insurance claim
- Completing a phone call
- Extracting data from a document
- Drafting an email
/inference
endpoint accepts an optional episode_id
field.
When you make the first inference request, you don’t have to provide an episode_id
.
The gateway will create a new episode for you and return the episode_id
in the response.
When you make the second inference request, you must provide the episode_id
you received in the first response.
The gateway will use the episode_id
to associate the two inference requests together.
You shouldn’t generate episode IDs yourself.
The gateway will create a new episode ID for you if you don’t provide one.
Then, you can use it with other inferences you’d like to associate with the episode.
POST /inference
Sample Output
Sample Output
Episode-Level Metrics
The primary use case for episodes is to enable episode-level metrics. In the previous example, we assigned feedback to individual inferences. TensorZero can also collect episode-level feedback, which can be useful for optimizing entire workflows. To collect episode-level feedback, we need to define a metric withlevel = "episode"
.
Let’s add a metric for the weather RAG example.
We’ll use the user_rating
as the metric name, and we’ll collect it as a float.
tensorzero.toml
episode_id
instead of an inference_id
associates the feedback with the entire episode.
POST /feedback
Sample Output
Sample Output
Part IV — Email Data Extraction
JSON Functions
Everything we’ve done so far has been with Chat Functions. TensorZero also supports JSON Functions for use cases that require structured outputs. The input is the same, but the function returns a JSON value instead of a chat message.Depending on the use case, both Chat Functions with Tool Use and JSON Functions can be used.
In fact, the TensorZero Gateway will sometimes convert between the two under the hood for model providers that don’t support one of them natively.As a rule of thumb, we typically recommend using JSON Functions if you have a single, well-defined output schema.
If you need more flexibility (e.g. letting the model pick between multiple tools, or whether to pick a tool at all), then Chat Functions with Tool Use is likely a better fit.
That said, try experimenting with both and see which one works best for your use case!
type = "json"
and requires an output_schema
.
Let’s start by defining the schema, a static system template, and the rest of the configuration.
output_schema.json
functions/extract_email/simple_variant/system.minijinja
tensorzero.toml
Like with Chat Functions, you can define multiple variants of a JSON function.
There are additional parameters (e.g.
json_mode
) that you can use to control the behavior of these variants.output
field instead of a content
field.
The output
field will be a JSON object with the fields parsed
and raw
.
The parsed
field is the parsed output as a valid JSON that fits your schema (null
if the model didn’t generate a JSON that matches your schema), and the raw
field is the raw output from the model as a string.
POST /inference
Sample Output
Sample Output
Conclusion
This tutorial only scratches the surface of what you can do with TensorZero. TensorZero especially shines when it comes to optimizing complex LLM workflows using the data collected by the gateway. For example, the structured data collected by the gateway can be used to better fine-tune models compared to using historical prompts and generations alone. We are working on a series of examples covering the entire “data flywheel in a box” that TensorZero provides. Here are some of our favorites:- Optimizing Data Extraction (NER) with TensorZero
- Agentic RAG — Multi-Hop Question Answering with LLMs
- Writing Haikus to Satisfy a Judge with Hidden Preferences
- Improving LLM Chess Ability with Best/Mixture-of-N Sampling
We’re working on many more examples, especially for advanced use cases.
Stay tuned!
Exploring TensorZero at work?We’d be happy to set up a Slack or Teams Connect channel with your team (free).
Email us at [email protected].