# Comparison: TensorZero vs. DSPy
Source: https://www.tensorzero.com/docs/comparison/dspy
TensorZero is an open-source alternative to DSPy featuring an LLM gateway, observability, optimization, evaluations, and experimentation.
TensorZero and DSPy serve **different but complementary** purposes in the LLM ecosystem.
TensorZero is a full-stack LLM engineering platform focused on production applications and optimization, while DSPy is a framework for programming with language models through modular prompting.
**You can get the best of both worlds by using DSPy and TensorZero together!**
## Similarities
* **LLM Optimization.**
Both TensorZero and DSPy focus on LLM optimization, but in different ways.
DSPy focuses on automated prompt engineering, while TensorZero provides a complete set of tools for optimizing LLM systems (including prompts, models, and inference strategies).
* **LLM Programming Abstractions.**
Both TensorZero and DSPy provide abstractions for working with LLMs in a structured way, moving beyond raw prompting to more maintainable approaches.
[→ Prompt Templates & Schemas with TensorZero](/gateway/create-a-prompt-template)
* **Automated Prompt Engineering.**
TensorZero implements GEPA and MIPROv2, the leading automated prompt engineering algorithms recommended by DSPy.
GEPA iteratively refines your prompt templates based on an inference evaluation, and MIPROv2 jointly optimizes instructions and in-context examples in prompts.
[→ Guide: Optimize your prompts with GEPA](/optimization/gepa)
[→ Recipe: Automated Prompt Engineering with MIPRO](https://github.com/tensorzero/tensorzero/tree/main/recipes/mipro)
## Key Differences
### TensorZero
* **Production Infrastructure.**
TensorZero provides complete production infrastructure including **observability, optimization, evaluations, and experimentation** capabilities.
DSPy focuses on the development phase and prompt programming patterns.
* **Model Optimization.**
TensorZero provides tools for optimizing models, including fine-tuning and RLHF.
DSPy primarily focuses on automated prompt engineering.
[→ Optimization Recipes with TensorZero](/recipes/)
* **Inference-Time Optimization.**
TensorZero provides inference-time optimizations like dynamic in-context learning.
DSPy focuses on offline optimization strategies (e.g. static in-context learning).
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
### DSPy
* **Advanced Automated Prompt Engineering.**
DSPy provides sophisticated automated prompt engineering tools for LLMs like teleprompters, recursive reasoning, and self-improvement loops.
TensorZero has some built-in prompt optimization features (more on the way) and integrates with DSPy for additional capabilities.
* **Lightweight Design.**
DSPy is a lightweight framework focused solely on LLM programming patterns, particularly during the R\&D stage.
TensorZero is a more comprehensive platform with additional infrastructure components covering end-to-end LLM engineering workflows.
Is TensorZero missing any features that are really important to you? Let us know on [GitHub Discussions](https://github.com/tensorzero/tensorzero/discussions), [Slack](https://www.tensorzero.com/slack), or [Discord](https://www.tensorzero.com/discord).
## Combining TensorZero and DSPy
You can get the best of both worlds by using DSPy and TensorZero together!
TensorZero provides a number of pre-built optimization recipes covering common LLM engineering workflows like supervised fine-tuning and RLHF.
But you can also easily export observability data for your own recipes and workflows.
# Comparison: TensorZero vs. LangChain
Source: https://www.tensorzero.com/docs/comparison/langchain
TensorZero is an open-source alternative to LangChain featuring an LLM gateway, observability, optimization, evaluations, and experimentation.
TensorZero and LangChain both provide tools for LLM orchestration, but they serve different purposes in the ecosystem.
While LangChain focuses on rapid prototyping with a large ecosystem of integrations, TensorZero is designed for production-grade deployments with built-in observability, optimization, evaluations, and experimentation capabilities.
We provide a minimal example [integrating TensorZero with LangGraph](https://github.com/tensorzero/tensorzero/tree/main/examples/integrations/langgraph).
## Similarities
* **LLM Orchestration.**
Both TensorZero and LangChain are developer tools that streamline LLM engineering workflows.
TensorZero focuses on production-grade deployments and end-to-end LLM engineering workflows (inference, observability, optimization, evaluations, experimentation).
LangChain focuses on rapid prototyping and offers complementary commercial products for features like observability.
* **Open Source.**
Both TensorZero (Apache 2.0) and LangChain (MIT) are open-source.
TensorZero is fully open-source (including TensorZero UI for observability), whereas LangChain requires a commercial offering for certain features (e.g. LangSmith for observability).
* **Unified Interface.**
Both TensorZero and LangChain offer a unified interface that allows you to access LLMs from most major model providers with a single integration, with support for structured outputs, tool use, streaming, and more.
[→ TensorZero Gateway Quickstart](/quickstart/)
* **Inference-Time Optimizations.**
Both TensorZero and LangChain offer inference-time optimizations like dynamic in-context learning.
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
* **Inference Caching.**
Both TensorZero and LangChain allow you to cache requests to improve latency and reduce costs.
[→ Inference Caching with TensorZero](/gateway/guides/inference-caching/)
## Key Differences
### TensorZero
* **Separation of Concerns: Application Engineering vs. LLM Optimization.**
TensorZero enables a clear separation between application logic and LLM implementation details.
By treating LLM functions as interfaces with structured inputs and outputs, TensorZero allows you to swap implementations without changing application code.
This approach makes it easier to manage complex LLM applications, enables GitOps for prompt and configuration management, and streamlines optimization and experimentation workflows.
LangChain blends application logic with LLM implementation details, streamlining rapid prototyping but making it harder to maintain and optimize complex applications.
[→ Prompt Templates & Schemas with TensorZero](/gateway/create-a-prompt-template)
[→ Advanced: Think of LLM Applications as POMDPs — Not Agents](https://www.tensorzero.com/blog/think-of-llm-applications-as-pomdps-not-agents/)
* **Open-Source Observability.**
TensorZero offers built-in observability features (including UI), collecting inference and feedback data in your own database.
LangChain requires a separate commercial service (LangSmith) for observability.
* **Built-in Optimization.**
TensorZero offers built-in optimization features, including supervised fine-tuning, RLHF, and automated prompt engineering recipes.
With the TensorZero UI, you can fine-tune models using your inference and feedback data in just a few clicks.
LangChain doesn't offer any built-in optimization features.
[→ Optimization Recipes with TensorZero](/recipes/)
* **Built-in Evaluations.**
TensorZero offers built-in evaluation functionality, including heuristics and LLM judges.
LangChain requires a separate commercial service (LangSmith) for evaluations.
[→ TensorZero Evaluations Overview](/evaluations/)
* **Automated Experimentation (A/B Testing).**
TensorZero offers built-in experimentation features, allowing you to run experiments on your prompts, models, and inference strategies.
LangChain doesn't offer any experimentation features.
[→ Run adaptive A/B tests with TensorZero](/experimentation/run-adaptive-ab-tests/)
* **Performance & Scalability.**
TensorZero is built from the ground up for high performance, with a focus on low latency and high throughput.
LangChain introduces substantial latency and memory overhead to your application.
[→ TensorZero Gateway Benchmarks](/gateway/benchmarks/)
* **Language and Platform Agnostic.**
TensorZero is language and platform agnostic; in addition to its Python client, it supports any language that can make HTTP requests.
LangChain only supports applications built in Python and JavaScript.
[→ TensorZero Gateway API Reference](/gateway/api-reference/inference/)
* **Batch Inference.**
TensorZero supports batch inference with certain model providers, which significantly reduces inference costs.
LangChain doesn't support batch inference.
[→ Batch Inference with TensorZero](/gateway/guides/batch-inference/)
* **Credential Management.**
TensorZero streamlines credential management for your model providers, allowing you to manage your API keys in a single place and set up advanced workflows like load balancing between API keys.
LangChain only offers basic credential management features.
[→ Credential Management with TensorZero](/operations/manage-credentials/)
* **Automatic Fallbacks for Higher Reliability.**
TensorZero allows you to very easily set up retries, fallbacks, load balancing, and routing to increase reliability.
LangChain only offers basic, cumbersome fallback functionality.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
### LangChain
* **Focus on Rapid Prototyping.**
LangChain is designed for rapid prototyping, with a focus on ease of use and rapid iteration.
TensorZero is designed for production-grade deployments, so it requires more setup and configuration (e.g. a database to store your observability data) — but you can still get started in minutes.
[→ TensorZero Quickstart — From 0 to Observability & Fine-Tuning](/quickstart/)
* **Ecosystem of Integrations.**
LangChain has a large ecosystem of integrations with other libraries and tools, including model providers, vector databases, observability tools, and more.
TensorZero provides many integrations with model providers, but delegates other integrations to the user.
* **Managed Service.**
LangChain offers paid managed (hosted) services for features like observability (LangSmith).
TensorZero is fully open-source and self-hosted.
Is TensorZero missing any features that are really important to you? Let us know on [GitHub Discussions](https://github.com/tensorzero/tensorzero/discussions), [Slack](https://www.tensorzero.com/slack), or [Discord](https://www.tensorzero.com/discord).
# Comparison: TensorZero vs. Langfuse
Source: https://www.tensorzero.com/docs/comparison/langfuse
TensorZero is an open-source alternative to Langfuse featuring an LLM gateway, observability, optimization, evaluations, and experimentation.
TensorZero and Langfuse both provide open-source tools that streamline LLM engineering workflows.
TensorZero focuses on inference and optimization, while Langfuse specializes in powerful interfaces for observability and evals.
That said, **you can get the best of both worlds by using TensorZero alongside Langfuse**.
## Similarities
* **Open Source & Self-Hosted.**
Both TensorZero and Langfuse are open source and self-hosted.
Your data never leaves your infrastructure, and you don't risk downtime by relying on external APIs.
TensorZero is fully open-source, whereas Langfuse gates some of its features behind a paid license.
* **Built-in Observability.**
Both TensorZero and Langfuse offer built-in observability features, collecting inference in your own database.
Langfuse offers a broader set of advanced observability features, including application-level tracing.
TensorZero focuses more on structured data collection for optimization, including downstream metrics and feedback.
* **Built-in Evaluations.**
Both TensorZero and Langfuse offer built-in evaluations features, enabling you to sanity check and benchmark the performance of your prompts, models, and more — using heuristics and LLM judges.
TensorZero LLM judges are also TensorZero functions, which means you can optimize them using TensorZero's optimization recipes.
Langfuse offers a broader set of built-in heuristics and UI features for evaluations.
[→ TensorZero Evaluations Overview](/evaluations/)
## Key Differences
### TensorZero
* **Unified Inference API.**
TensorZero offers a unified inference API that allows you to access LLMs from most major model providers with a single integration, with support for structured outputs, tool use, streaming, and more.
Langfuse doesn't provide a built-in LLM gateway.
[→ TensorZero Gateway Quickstart](/quickstart/)
* **Built-in Inference-Time Optimizations.**
TensorZero offers built-in inference-time optimizations (e.g. dynamic in-context learning), allowing you to optimize your inference performance.
Langfuse doesn't offer any inference-time optimizations.
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
* **Optimization Recipes.**
TensorZero offers optimization recipes (e.g. supervised fine-tuning, RLHF, MIPRO) that leverage your own data to improve your LLM's performance.
Langfuse doesn't offer built-in features like this.
[→ Optimization Recipes with TensorZero](/recipes/)
* **Automatic Fallbacks for Higher Reliability.**
TensorZero offers automatic fallbacks to increase reliability.
Langfuse doesn't offer any such features.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
* **Automated Experimentation (A/B Testing).**
TensorZero offers built-in experimentation features, allowing you to run experiments on your prompts, models, and inference strategies.
Langfuse doesn't offer any experimentation features.
[→ Run adaptive A/B tests with TensorZero](/experimentation/run-adaptive-ab-tests/)
### Langfuse
* **Advanced Observability & Evaluations.**
While both TensorZero and Langfuse offer observability and evaluations features, Langfuse takes it further with advanced observability features.
Additionally, Langfuse offers a prompt playground, which TensorZero doesn't offer (coming soon!).
* **Access Control.**
Langfuse offers access control features like SSO and user management.
TensorZero supports TensorZero API key for inference, but more advanced access control requires complementary tools like Nginx or OAuth2 Proxy.
[→ Set up auth for TensorZero](/operations/set-up-auth-for-tensorzero)
* **Managed Service.**
Langfuse offers a paid managed (hosted) service in addition to the open-source version.
TensorZero is fully open-source and self-hosted.
Is TensorZero missing any features that are really important to you? Let us know on [GitHub Discussions](https://github.com/tensorzero/tensorzero/discussions), [Slack](https://www.tensorzero.com/slack), or [Discord](https://www.tensorzero.com/discord).
## Combining TensorZero and Langfuse
You can combine TensorZero and Langfuse to get the best of both worlds.
A leading voice agent startup uses TensorZero for inference and optimization, alongside Langfuse for more advanced observability and evals.
# Comparison: TensorZero vs. LiteLLM
Source: https://www.tensorzero.com/docs/comparison/litellm
TensorZero is an open-source alternative to LiteLLM featuring an LLM gateway, observability, optimization, evaluations, and experimentation.
TensorZero and LiteLLM both offer a unified inference API for LLMs, but they have different features beyond that.
TensorZero offers a broader set of features (including observability, optimization, evaluations, and experimentation), whereas LiteLLM offers more traditional gateway features (e.g. budgeting, queuing) and third-party integrations.
That said, **you can get the best of both worlds by using LiteLLM as a model provider inside TensorZero**!
## Similarities
* **Unified Inference API.**
Both TensorZero and LiteLLM offer a unified inference API that allows you to access LLMs from most major model providers with a single integration, with support for structured outputs, batch inference, tool use, streaming, and more.
[→ TensorZero Gateway Quickstart](/quickstart/)
* **Automatic Fallbacks for Higher Reliability.**
Both TensorZero and LiteLLM offer automatic fallbacks to increase reliability.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
* **Open Source & Self-Hosted.**
Both TensorZero and LiteLLM are open source and self-hosted.
Your data never leaves your infrastructure, and you don't risk downtime by relying on external APIs.
TensorZero is fully open-source, whereas LiteLLM gates some of its features behind an enterprise license.
* **Inference Caching.**
Both TensorZero and LiteLLM allow you to cache requests to improve latency and reduce costs.
[→ Inference Caching with TensorZero](/gateway/guides/inference-caching/)
* **Multimodal Inference.**
Both TensorZero and LiteLLM support multimodal inference.
[→ Multimodal Inference with TensorZero](/gateway/guides/multimodal-inference/)
## Key Differences
### TensorZero
* **High Performance.**
The TensorZero Gateway was built from the ground up in Rust 🦀 with performance in mind (\<1ms P99 latency at 10,000 QPS).
LiteLLM is built in Python, resulting in 25-100x+ latency overhead and much lower throughput.
[→ Performance Benchmarks: TensorZero vs. LiteLLM](/gateway/benchmarks/)
* **Built-in Observability.**
TensorZero offers its own observability features, collecting inference and feedback data in your own database.
LiteLLM only offers integrations with third-party observability tools like Langfuse.
* **Built-in Evaluations.**
TensorZero offers built-in evaluation functionality, including heuristics and LLM judges.
LiteLLM doesn't offer any evaluations functionality.
[→ TensorZero Evaluations Overview](/evaluations/)
* **Automated Experimentation (A/B Testing).**
TensorZero offers built-in experimentation features, allowing you to run experiments on your prompts, models, and inference strategies.
LiteLLM doesn't offer any experimentation features.
[→ Run adaptive A/B tests with TensorZero](/experimentation/run-adaptive-ab-tests/)
* **Built-in Inference-Time Optimizations.**
TensorZero offers built-in inference-time optimizations (e.g. dynamic in-context learning), allowing you to optimize your inference performance.
LiteLLM doesn't offer any inference-time optimizations.
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
* **Optimization Recipes.**
TensorZero offers optimization recipes (e.g. supervised fine-tuning, RLHF, MIPRO) that leverage your own data to improve your LLM's performance.
LiteLLM doesn't offer any features like this.
[→ Optimization Recipes with TensorZero](/recipes/)
* **Schemas, Templates, GitOps.**
TensorZero enables a schema-first approach to building LLM applications, allowing you to separate your application logic from LLM implementation details.
This approach allows your to more easily manage complex LLM applications, benefit from GitOps for prompt and configuration management, counterfactually improve data for optimization, and more.
LiteLLM only offers the standard unstructured chat completion interface.
[→ Prompt Templates & Schemas with TensorZero](/gateway/create-a-prompt-template)
* **Access Control.**
Both TensorZero and LiteLLM support virtual (custom) API keys to authenticate requests.
LiteLLM offers advanced authentication features in its enterprise plan, whereas TensorZero requires complementary open-source tools like Nginx or OAuth2 Proxy for such use cases.
[→ Set up auth for TensorZero](/operations/set-up-auth-for-tensorzero)
### LiteLLM
* **Dynamic Provider Routing.**
LiteLLM allows you to dynamically route requests to different model providers based on latency, cost, and rate limits.
TensorZero only offers static routing capabilities, i.e. a pre-defined sequence of model providers to attempt.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
* **Request Prioritization.**
LiteLLM allows you to prioritize requests over others, which can be useful for high-priority tasks when you're constrained by rate limits.
TensorZero doesn't offer request prioritization, and instead requires you to manage the request queue externally (e.g. using Redis).
* **Built-in Guardrails Integration.**
LiteLLM offers built-in support for integrations with guardrails tools like AWS Bedrock.
For now, TensorZero doesn't offer built-in guardrails, and instead requires you to manage integrations yourself.
* **Managed Service.**
LiteLLM offers a paid managed (hosted) service in addition to the open-source version.
TensorZero is fully open-source and self-hosted.
Is TensorZero missing any features that are really important to you? Let us know on [GitHub Discussions](https://github.com/tensorzero/tensorzero/discussions), [Slack](https://www.tensorzero.com/slack), or [Discord](https://www.tensorzero.com/discord).
## Combining TensorZero and LiteLLM
You can get the best of both worlds by using LiteLLM as a model provider inside TensorZero.
LiteLLM offers an OpenAI-compatible API, so you can use TensorZero's OpenAI-compatible endpoint to call LiteLLM.
Learn more about using [OpenAI-compatible endpoints](/integrations/model-providers/openai-compatible/).
# Comparison: TensorZero vs. OpenPipe
Source: https://www.tensorzero.com/docs/comparison/openpipe
TensorZero is an open-source alternative to OpenPipe featuring an LLM gateway, observability, optimization, evaluations, and experimentation.
TensorZero and OpenPipe both provide tools that streamline fine-tuning workflows for LLMs.
TensorZero is open-source and self-hosted, while OpenPipe is a paid managed service (inference costs \~2x more than specialized providers supported by TensorZero).
That said, **you can get the best of both worlds by using OpenPipe as a model provider inside TensorZero**.
## Similarities
* **LLM Optimization (Fine-Tuning).**
Both TensorZero and OpenPipe focus on LLM optimization (e.g. fine-tuning, DPO).
OpenPipe focuses on fine-tuning, while TensorZero provides a complete set of tools for optimizing LLM systems (including prompts, models, and inference strategies).
[→ Optimization Recipes with TensorZero](/recipes/)
* **Built-in Observability.**
Both TensorZero and OpenPipe offer built-in observability features.
TensorZero stores inference data in your own database for full privacy and control, while OpenPipe stores it themselves in their own cloud.
* **Built-in Evaluations.**
Both TensorZero and OpenPipe offer built-in evaluations features, enabling you to sanity check and benchmark the performance of your prompts, models, and more — using heuristics and LLM judges.
TensorZero LLM judges are also TensorZero functions, which means you can optimize them using TensorZero's optimization recipes.
[→ TensorZero Evaluations Overview](/evaluations/)
## Key Differences
### TensorZero
* **Open Source & Self-Hosted.**
TensorZero is fully open source and self-hosted.
Your data never leaves your infrastructure, and you don't risk downtime by relying on external APIs.
OpenPipe is a closed-source managed service.
* **No Added Cost (& Cheaper Inference Providers).**
TensorZero is free to use: your bring your own LLM API keys and there is no additional cost.
OpenPipe charges \~2x on inference costs compared to specialized providers supported by TensorZero (e.g. Fireworks AI).
* **Unified Inference API.**
TensorZero offers a unified inference API that allows you to access LLMs from most major model providers with a single integration, with support for structured outputs, tool use, streaming, and more.
OpenPipe supports a much smaller set of LLMs.
[→ TensorZero Gateway Quickstart](/quickstart/)
* **Built-in Inference-Time Optimizations.**
TensorZero offers built-in inference-time optimizations (e.g. dynamic in-context learning), allowing you to optimize your inference performance.
OpenPipe doesn't offer any inference-time optimizations.
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
* **Automatic Fallbacks for Higher Reliability.**
TensorZero is self-hosted and provides automatic fallbacks between model providers to increase reliability.
OpenPipe can fallback their own models to other OpenAI-compatible APIs, but if OpenPipe itself goes down, you're out of luck.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
* **Automated Experimentation (A/B Testing).**
TensorZero offers built-in experimentation features, allowing you to run experiments on your prompts, models, and inference strategies.
OpenPipe doesn't offer any experimentation features.
[→ Run adaptive A/B tests with TensorZero](/experimentation/run-adaptive-ab-tests/)
* **Batch Inference.**
TensorZero supports batch inference with certain model providers, which significantly reduces inference costs.
OpenPipe doesn't support batch inference.
[→ Batch Inference with TensorZero](/gateway/guides/batch-inference/)
* **Inference Caching.**
Both TensorZero and OpenPipe allow you to cache requests to improve latency and reduce costs.
OpenPipe only caches requests to their own models, while TensorZero caches requests to all model providers.
[→ Inference Caching with TensorZero](/gateway/guides/inference-caching/)
* **Schemas, Templates, GitOps.**
TensorZero enables a schema-first approach to building LLM applications, allowing you to separate your application logic from LLM implementation details.
This approach allows your to more easily manage complex LLM applications, benefit from GitOps for prompt and configuration management, counterfactually improve data for optimization, and more.
OpenPipe only offers the standard unstructured chat completion interface.
[→ Prompt Templates & Schemas with TensorZero](/gateway/create-a-prompt-template)
### OpenPipe
* **Guardrails.**
OpenPipe offers guardrails (runtime AI judges) for your fine-tuned models.
TensorZero doesn't offer built-in guardrails, and instead requires you to manage them yourself.
Is TensorZero missing any features that are really important to you? Let us know on [GitHub Discussions](https://github.com/tensorzero/tensorzero/discussions), [Slack](https://www.tensorzero.com/slack), or [Discord](https://www.tensorzero.com/discord).
## Combining TensorZero and OpenPipe
You can get the best of both worlds by using OpenPipe as a model provider inside TensorZero.
OpenPipe provides an OpenAI-compatible API, so you can use models previously fine-tuned with OpenPipe with TensorZero.
Learn more about using [OpenAI-compatible endpoints](/integrations/model-providers/openai-compatible/).
# Comparison: TensorZero vs. OpenRouter
Source: https://www.tensorzero.com/docs/comparison/openrouter
TensorZero is an open-source alternative to OpenRouter featuring an LLM gateway, observability, optimization, evaluations, and experimentation.
TensorZero and OpenRouter both offer a unified inference API for LLMs, but they have different features beyond that.
TensorZero offers a more comprehensive set of features (including observability, optimization, evaluations, and experimentation), whereas OpenRouter offers more dynamic routing capabilities.
That said, **you can get the best of both worlds by using OpenRouter as a model provider inside TensorZero**!
## Similarities
* **Unified Inference API.**
Both TensorZero and OpenRouter offer a unified inference API that allows you to access LLMs from most major model providers with a single integration, with support for structured outputs, tool use, streaming, and more.
[→ TensorZero Gateway Quickstart](/quickstart/)
* **Automatic Fallbacks for Higher Reliability.**
Both TensorZero and OpenRouter offer automatic fallbacks to increase reliability.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
## Key Differences
### TensorZero
* **Open Source & Self-Hosted.**
TensorZero is fully open source and self-hosted.
Your data never leaves your infrastructure, and you don't risk downtime by relying on external APIs.
OpenRouter is a closed-source external API.
* **No Added Cost.**
TensorZero is free to use: your bring your own LLM API keys and there is no additional cost.
OpenRouter charges 5% of your inference spend when you bring your own API keys.
* **Built-in Observability.**
TensorZero offers built-in observability features, collecting inference and feedback data in your own database.
OpenRouter doesn't offer any observability features.
* **Built-in Evaluations.**
TensorZero offers built-in functionality, including heuristics and LLM judges.
OpenRouter doesn't offer any evaluation features.
[→ TensorZero Evaluations Overview](/evaluations/)
* **Automated Experimentation (A/B Testing).**
TensorZero offers built-in experimentation features, allowing you to run experiments on your prompts, models, and inference strategies.
OpenRouter doesn't offer any experimentation features.
[→ Run adaptive A/B tests with TensorZero](/experimentation/run-adaptive-ab-tests/)
* **Built-in Inference-Time Optimizations.**
TensorZero offers built-in inference-time optimizations (e.g. dynamic in-context learning), allowing you to optimize your inference performance.
OpenRouter doesn't offer any inference-time optimizations, except for dynamic model routing via NotDiamond.
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
* **Optimization Recipes.**
TensorZero offers optimization recipes (e.g. supervised fine-tuning, RLHF, MIPRO) that leverage your own data to improve your LLM's performance.
OpenRouter doesn't offer any features like this.
[→ Optimization Recipes with TensorZero](/recipes/)
* **Batch Inference.**
TensorZero supports batch inference with certain model providers, which significantly reduces inference costs.
OpenRouter doesn't support batch inference.
[→ Batch Inference with TensorZero](/gateway/guides/batch-inference/)
* **Inference Caching.**
TensorZero offers inference caching, which can significantly reduce inference costs and latency.
OpenRouter doesn't offer inference caching.
[→ Inference Caching with TensorZero](/gateway/guides/inference-caching/)
* **Schemas, Templates, GitOps.**
TensorZero enables a schema-first approach to building LLM applications, allowing you to separate your application logic from LLM implementation details.
This approach allows your to more easily manage complex LLM applications, benefit from GitOps for prompt and configuration management, counterfactually improve data for optimization, and more.
OpenRouter only offers the standard unstructured chat completion interface.
[→ Prompt Templates & Schemas with TensorZero](/gateway/create-a-prompt-template)
### OpenRouter
* **Dynamic Provider Routing.**
OpenRouter allows you to dynamically route requests to different model providers based on latency, cost, and availability.
TensorZero only offers static routing capabilities, i.e. a pre-defined sequence of model providers to attempt.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
* **Dynamic Model Routing.**
OpenRouter integrates with NotDiamond to offer dynamic model routing based on input.
TensorZero supports other inference-time optimizations but doesn't support dynamic model routing at this time.
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
* **Consolidated Billing.**
OpenRouter allows you to access every supported model using a single OpenRouter API key.
Under the hood, OpenRouter uses their own API keys with model providers.
This approach can increase your rate limits and streamline billing, but slightly increases your inference costs.
TensorZero requires you to use your own API keys, without any added cost.
Is TensorZero missing any features that are really important to you? Let us know on [GitHub Discussions](https://github.com/tensorzero/tensorzero/discussions), [Slack](https://www.tensorzero.com/slack), or [Discord](https://www.tensorzero.com/discord).
## Combining TensorZero and OpenRouter
You can get the best of both worlds by using OpenRouter as a model provider inside TensorZero.
Learn more about using [OpenRouter as a model provider](/integrations/model-providers/openrouter/).
# Comparison: TensorZero vs. Portkey
Source: https://www.tensorzero.com/docs/comparison/portkey
TensorZero is an open-source alternative to Portkey featuring an LLM gateway, observability, optimization, evaluations, and experimentation.
TensorZero and Portkey offer diverse features to streamline LLM engineering, including an LLM gateway, observability tools, and more.
TensorZero is fully open-source and self-hosted, while Portkey offers an open-source gateway but otherwise requires a paid commercial (hosted) service.
Additionally, TensorZero has more features around LLM optimization (e.g. advanced fine-tuning workflows and inference-time optimizations), whereas Portkey has a broader set of features around the UI (e.g. prompt playground).
## Similarities
* **Unified Inference API.**
Both TensorZero and Portkey offer a unified inference API that allows you to access LLMs from most major model providers with a single integration, with support for structured outputs, batch inference, tool use, streaming, and more.
[→ TensorZero Gateway Quickstart](/quickstart/)
* **Automatic Fallbacks, Retries, & Load Balancing for Higher Reliability.**
Both TensorZero and Portkey offer automatic fallbacks, retries, and load balancing features to increase reliability.
[→ Retries & Fallbacks with TensorZero](/gateway/guides/retries-fallbacks/)
* **Schemas, Templates.**
Both TensorZero and Portkey offer schema and template features to help you manage your LLM applications.
[→ Prompt Templates & Schemas with TensorZero](/gateway/create-a-prompt-template)
* **Multimodal Inference.**
Both TensorZero and Portkey support multimodal inference.
[→ Multimodal Inference with TensorZero](/gateway/guides/multimodal-inference/)
## Key Differences
### TensorZero
* **Open-Source Observability.**
TensorZero offers built-in open-source observability features, collecting inference and feedback data in your own database.
Portkey also offers observability features, but they are limited to their commercial (hosted) offering.
* **Built-in Evaluations.**
TensorZero offers built-in evaluation functionality, including heuristics and LLM judges.
Portkey doesn't offer any evaluation features.
[→ TensorZero Evaluations Overview](/evaluations/)
* **Open-Source Inference Caching.**
TensorZero offers open-source inference caching features, allowing you to cache requests to improve latency and reduce costs.
Portkey also offers inference caching features, but they are limited to their commercial (hosted) offering.
[→ Inference Caching with TensorZero](/gateway/guides/inference-caching/)
* **Open-Source Fine-Tuning Workflows.**
TensorZero offers open-source built-in fine-tuning workflows, allowing you to create custom models using your own data.
Portkey also offers fine-tuning features, but they are limited to their enterprise (\$\$\$) offering.
[→ Fine-Tuning Recipes with TensorZero](/recipes/)
* **Advanced Fine-Tuning Workflows.**
TensorZero offers advanced fine-tuning workflows, including the ability to curate datasets using feedback signals (e.g. production metrics) and the ability to use RLHF for reinforcement learning.
Portkey doesn't offer similar features.
[→ Fine-Tuning Recipes with TensorZero](/recipes/)
* **Automated Experimentation (A/B Testing).**
TensorZero offers advanced A/B testing features, including automated experimentation, to help your identify the best models and prompts for your use cases.
Portkey only offers simple canary and A/B testing features.
[→ Run adaptive A/B tests with TensorZero](/experimentation/run-adaptive-ab-tests/)
* **Inference-Time Optimizations.**
TensorZero offers built-in inference-time optimizations (e.g. dynamic in-context learning), allowing you to optimize your inference performance.
Portkey doesn't offer any inference-time optimizations.
[→ Inference-Time Optimizations with TensorZero](/gateway/guides/inference-time-optimizations/)
* **Programmatic & GitOps-Friendly Orchestration.**
TensorZero can be fully orchestrated programmatically in a GitOps-friendly way.
Portkey can manage some of its features programmatically, but certain features depend on its external commercial hosted service.
* **Open-Source Access Control.**
Both TensorZero and Portkey offer access control features like TensorZero API keys.
Portkey only offers them in the commercial (hosted) offering, whereas TensorZero's solution is fully open-source.
[→ Set up auth for TensorZero](/operations/set-up-auth-for-tensorzero)
### Portkey
* **Prompt Playground.**
Portkey offers a prompt playground in its commercial (hosted) offering, allowing you to test your prompts and models in a graphical interface.
TensorZero doesn't offer a prompt playground today (coming soon!).
* **Guardrails.**
Portkey offers guardrails features, including integrations with third-party guardrails providers and the ability to use custom guardrails using webhooks.
For now, TensorZero doesn't offer built-in guardrails, and instead requires you to manage integrations yourself.
* **Managed Service.**
Portkey offers a paid managed (hosted) service in addition to the open-source version.
TensorZero is fully open-source and self-hosted.
# Deploy ClickHouse (optional)
Source: https://www.tensorzero.com/docs/deployment/clickhouse
Learn how to deploy ClickHouse for TensorZero's observability features.
The TensorZero Gateway can optionally collect inference and feedback data for observability, optimization, evaluation, and experimentation.
Under the hood, TensorZero stores this data in ClickHouse, an open-source columnar database that is optimized for analytical workloads.
If you're planning to use the gateway without observability, you don't need to
deploy ClickHouse.
## Deploy
### Development
For development purposes, you can run a single-node ClickHouse instance locally (e.g. using Homebrew or Docker) or a cheap Development-tier cluster on ClickHouse Cloud.
See the [ClickHouse documentation](https://clickhouse.com/docs/install) for more details on configuring your ClickHouse deployment.
### Production
#### Managed deployments
For production deployments, the easiest setup is to use a managed service like ClickHouse Cloud.
ClickHouse Cloud is also available through the [AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-jettukeanwrfc), [GCP Marketplace](https://console.cloud.google.com/marketplace/product/clickhouse-public/clickhouse-cloud), and [Azure Marketplace](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/clickhouse.clickhouse_cloud).
Other options for managed ClickHouse deployments include [Tinybird](https://www.tinybird.co/) (serverless) and [Altinity](https://www.altinity.com/) (hands-on support).
TensorZero tests against ClickHouse Cloud's `regular` (recommended) and `fast` release channels.
#### Self-hosted deployments
You can alternatively run your own self-managed ClickHouse instance or cluster.
**We strongly recommend using ClickHouse `lts` instead of `latest` in production.**
We test against both versions, but ClickHouse `latest` often has bugs and breaking changes.
TensorZero supports single-node and replicated deployments.
TensorZero does not currently support **sharded** self-hosted ClickHouse deployments.
See the [ClickHouse documentation](https://clickhouse.com/docs/install) for more details on configuring your ClickHouse deployment.
## Configure
### Connect to ClickHouse
To configure TensorZero to use ClickHouse, set the `TENSORZERO_CLICKHOUSE_URL` environment variable with your ClickHouse connection details.
```bash title=".env" theme={null}
TENSORZERO_CLICKHOUSE_URL="http[s]://[username]:[password]@[hostname]:[port]/[database]"
# Example: ClickHouse running locally
TENSORZERO_CLICKHOUSE_URL="http://chuser:chpassword@localhost:8123/tensorzero"
# Example: ClickHouse Cloud
TENSORZERO_CLICKHOUSE_URL="https://USERNAME:PASSWORD@XXXXX.clickhouse.cloud:8443/tensorzero"
# Example: TensorZero Gateway running in a container, ClickHouse running on host machine
TENSORZERO_CLICKHOUSE_URL="http://host.docker.internal:8123/tensorzero"
```
If you're using a self-hosted replicated ClickHouse deployment, you must also provide the ClickHouse cluster name in the `TENSORZERO_CLICKHOUSE_CLUSTER_NAME` environment variable.
### Apply ClickHouse migrations
By default, the TensorZero Gateway applies ClickHouse migrations automatically when it starts up.
This behavior can be suppressed by setting `observability.disable_automatic_migrations = true` under the `[gateway]` section of `config/tensorzero.toml`.
See [https://www.tensorzero.com/docs/gateway/configuration-reference#gateway](https://www.tensorzero.com/docs/gateway/configuration-reference#gateway).
If automatic migrations are disabled, then you must apply them manually with
`docker run --rm -e TENSORZERO_CLICKHOUSE_URL=$TENSORZERO_CLICKHOUSE_URL tensorzero/gateway:{version} --run-clickhouse-migrations`.
The gateway will error on startup if automatic migrations are disabled and any required migrations are missing.
If you're using a self-hosted replicated ClickHouse deployment, you must apply database migrations manually;
they cannot be applied automatically.
# Optimize latency and throughput
Source: https://www.tensorzero.com/docs/deployment/optimize-latency-and-throughput
Learn how to optimize the performance of the TensorZero Gateway for lower latency and higher throughput.
The TensorZero Gateway is designed from the ground up with performance in mind.
Even with default settings, the gateway is fast and lightweight enough to be unnoticeable in most applications.
The best practices below are designed to help you optimize the performance of the TensorZero Gateway for production deployments requiring maximum performance.
The TensorZero Gateway can achieve \<1ms P99 latency overhead at 10,000+ QPS. See [Benchmarks](/gateway/benchmarks/) for details.
## Best practices
### Observability data collection strategy
By default, the gateway takes a conservative approach to observability data durability, ensuring that data is persisted in ClickHouse before sending a response to the client.
This strategy provides a consistent and reliable experience but can introduce latency overhead.
For scenarios where latency and throughput are critical, the gateway can be configured to sacrifice data durability guarantees for better performance.
If latency is critical for your application, you can enable `gateway.observability.async_writes` or `gateway.observability.batch_writes`.
With either of these settings, the gateway will return the response to the client immediately and asynchronously insert data into ClickHouse.
The former will immediately insert each row individually, while the latter will batch multiple rows together for more efficient writes.
As a rule of thumb, consider the following decision matrix:
| | **High throughput** | **Low throughput** |
| --------------------------- | ------------------- | ------------------ |
| **Latency is critical** | `batch_writes` | `async_writes` |
| **Latency is not critical** | `batch_writes` | Default strategy |
See the [Configuration Reference](/gateway/configuration-reference/) for more details.
### Other recommendations
* Ensure your application, the TensorZero Gateway, and ClickHouse are deployed in the same region to minimize network latency.
* Initialize the client once and reuse it as much as possible, to avoid initialization overhead and to keep the connection alive.
# Deploy Postgres (optional)
Source: https://www.tensorzero.com/docs/deployment/postgres
Learn how to deploy Postgres for advanced TensorZero features.
**Most TensorZero deployments will not require Postgres.**
TensorZero only requires Postgres for certain advanced features.
Most notably, you need to deploy Postgres to [run adaptive A/B tests](/experimentation/run-adaptive-ab-tests) and [set up auth for TensorZero](/operations/set-up-auth-for-tensorzero).
## Deploy
You can self-host Postgres or use a managed service (e.g. AWS RDS, Supabase, PlanetScale).
Follow the deployment instructions for your chosen service.
Internally, we test TensorZero using self-hosted Postgres 14.
If you find any compatibility issues, please open a detailed [GitHub Discussion](https://github.com/tensorzero/tensorzero/discussions/new?category=bug-reports).
## Configure
### Connect to Postgres
To configure TensorZero to use Postgres, set the `TENSORZERO_POSTGRES_URL` environment variable with your Postgres connection details.
```bash title=".env" theme={null}
TENSORZERO_POSTGRES_URL="postgres://[username]:[password]@[hostname]:[port]/[database]"
# Example:
TENSORZERO_POSTGRES_URL="postgres://myuser:mypass@localhost:5432/tensorzero"
```
### Apply Postgres migrations
Unlike with ClickHouse, **TensorZero does not automatically apply Postgres migrations.**
You must apply migrations manually with `gateway --run-postgres-migrations`.
If you've configured the gateway with Docker Compose, you can run the migrations with:
```bash theme={null}
docker-compose run --rm gateway --run-postgres-migrations
```
See [Deploy the TensorZero Gateway](/deployment/tensorzero-gateway) for more details.
In most other cases, you can run the migrations with:
```bash theme={null}
docker run --rm --network host \
-e TENSORZERO_POSTGRES_URL \
tensorzero/gateway --run-postgres-migrations
```
# Set up TensorZero Autopilot
Source: https://www.tensorzero.com/docs/deployment/tensorzero-autopilot
Learn how to set up TensorZero Autopilot on your self-hosted TensorZero deployme.
TensorZero Autopilot is an automated AI engineer that analyzes LLM observability data, optimizes prompts and models, sets up evals, and runs A/B tests.
It's an optional complementary service that runs on top of your self-hosted TensorZero deployment.
TensorZero Autopilot is currently in a private beta. [Join the waitlist →](https://tensorzerodotcom.notion.site/2d87520bbad380c9ad0dd19566b3bc91)
## Deploy
Visit [autopilot.tensorzero.com](https://autopilot.tensorzero.com/) to generate an API key.
Set the environment variable `TENSORZERO_AUTOPILOT_API_KEY` for your TensorZero Gateway:
```bash theme={null}
export TENSORZERO_AUTOPILOT_API_KEY="sk-t0-..."
```
TensorZero Autopilot requires the TensorZero Gateway, TensorZero UI, ClickHouse, and Postgres.
Make sure the gateway has the `TENSORZERO_AUTOPILOT_API_KEY` environment variable.
Learn more about how to:
* [Deploy the TensorZero Gateway](/deployment/tensorzero-gateway)
* [Deploy the TensorZero UI](/deployment/tensorzero-ui)
* [Deploy ClickHouse](/deployment/clickhouse)
* [Deploy Postgres](/deployment/postgres)
Visit `/autopilot` in the self-hosted TensorZero UI to use Autopilot.
# Deploy the TensorZero Gateway
Source: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
Learn how to deploy and customize the TensorZero Gateway.
The TensorZero Gateway is the core component that handles inference requests and collects observability data.
It's easy to get started with the TensorZero Gateway.
You need to only deploy a standalone gateway if you plan to use the TensorZero UI or interact with the gateway using programming languages other than Python.
The TensorZero Python SDK includes a built-in embedded gateway, so you don't need to deploy a standalone gateway if you're only using Python.
See the [Clients](/gateway/clients/) page for more details on how to interact with the TensorZero Gateway.
## Deploy
The gateway requires one of the following command line arguments:
* `--default-config`: Use default configuration settings.
* `--config-file path/to/tensorzero.toml`: Use a custom configuration file.
`--config-file` supports glob patterns, e.g. `--config-file
/path/to/**/*.toml`.
* `--run-clickhouse-migrations`: Run ClickHouse database migrations and exit.
* `--run-postgres-migrations`: Run PostgreSQL database migrations and exit.
There are many ways to deploy the TensorZero Gateway.
Here are a few examples:
You can easily run the TensorZero Gateway locally using Docker.
If you don't have custom configuration, you can use:
```bash title="Running with Docker (default configuration)" theme={null}
docker run \
--env-file .env \
-p 3000:3000 \
tensorzero/gateway \
--default-config
```
If you have custom configuration, you can use:
```bash title="Running with Docker (custom configuration)" theme={null}
docker run \
-v "./config:/app/config" \
--env-file .env \
-p 3000:3000 \
tensorzero/gateway \
--config-file config/tensorzero.toml
```
We provide an example production-grade [`docker-compose.yml`](https://github.com/tensorzero/tensorzero/blob/main/examples/production-deployment/docker-compose.yml) for reference.
We provide a reference Helm chart in our [GitHub repository](https://github.com/tensorzero/tensorzero/tree/main/examples/production-deployment-k8s-helm).
You can use it to run TensorZero in Kubernetes.
The chart is available on [ArtifactHub](https://artifacthub.io/packages/helm/tensorzero/tensorzero).
You can build the TensorZero Gateway from source and run it directly on your host machine using [Cargo](https://doc.rust-lang.org/cargo/).
```bash title="Building from source" theme={null}
cargo run --profile performance --bin gateway -- --config-file path/to/your/tensorzero.toml
```
See the [optimizing latency and throughput](/deployment/optimize-latency-and-throughput/) guide to learn how to configure the gateway for high-performance deployments.
## Configure
### Set up model provider credentials
The TensorZero Gateway accepts the following environment variables for provider credentials.
Unless you specify an alternative credential location in your configuration file, these environment variables are required for the providers that are used in a variant with positive weight.
If required credentials are missing, the gateway will fail on startup.
Unless customized in your configuration file, the following credentials are used by default:
| Provider | Environment Variable(s) |
| ----------------------- | --------------------------------------------------------------------------------------------------------- |
| Anthropic | `ANTHROPIC_API_KEY` |
| AWS Bedrock | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` (see [details](/integrations/model-providers/aws-bedrock)) |
| AWS SageMaker | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` (see [details](/integrations/model-providers/aws-sagemaker)) |
| Azure OpenAI | `AZURE_API_KEY` |
| Fireworks | `FIREWORKS_API_KEY` |
| GCP Vertex AI Anthropic | `GCP_VERTEX_CREDENTIALS_PATH` (see [details](/integrations/model-providers/gcp-vertex-ai-anthropic)) |
| GCP Vertex AI Gemini | `GCP_VERTEX_CREDENTIALS_PATH` (see [details](/integrations/model-providers/gcp-vertex-ai-gemini)) |
| Google AI Studio Gemini | `GOOGLE_AI_STUDIO_GEMINI_API_KEY` |
| Groq | `GROQ_API_KEY` |
| Hyperbolic | `HYPERBOLIC_API_KEY` |
| Mistral | `MISTRAL_API_KEY` |
| OpenAI | `OPENAI_API_KEY` |
| OpenRouter | `OPENROUTER_API_KEY` |
| Together | `TOGETHER_API_KEY` |
| xAI | `XAI_API_KEY` |
See [`.env.example`](https://github.com/tensorzero/tensorzero/blob/main/examples/production-deployment/.env.example) for a complete example with every supported environment variable.
### Set up custom configuration
Optionally, you can use a configuration file to customize the behavior of the gateway.
See [Configuration Reference](/gateway/configuration-reference) for more details.
TensorZero collects *pseudonymous* usage analytics to help our team improve the product.
The collected data includes *aggregated* metrics about TensorZero itself, but does NOT include your application's data.
To be explicit: TensorZero does NOT share any inference input or output.
TensorZero also does NOT share the name of any function, variant, metric, or similar application-specific identifiers.
See `howdy.rs` in the GitHub repository to see exactly what usage data is collected and shared with TensorZero.
To disable usage analytics, set the following configuration in the `tensorzero.toml` file:
```toml title="tensorzero.toml" theme={null}
[gateway]
disable_pseudonymous_usage_analytics = true
```
Alternatively, you can also set the environment variable `TENSORZERO_DISABLE_PSEUDONYMOUS_USAGE_ANALYTICS=1`.
### Set up observability with ClickHouse
Optionally, the TensorZero Gateway can collect inference and feedback data for observability, optimization, evaluations, and experimentation.
After [deploying ClickHouse](/deployment/clickhouse), you need to configure the `TENSORZERO_CLICKHOUSE_URL` environment variable with the connection details.
If you don't provide this environment variable, observability will be disabled.
We recommend setting up observability early to monitor your LLM application and collect data for future optimization, but this can be done incrementally as needed.
### Customize the logging format
Optionally, you can provide the following command line argument to customize the gateway's logging format:
* `--log-format`: Set the logging format to either `pretty` (default) or `json`.
### Add a status or health check
The TensorZero Gateway exposes endpoints for status and health checks.
The `/status` endpoint checks that the gateway is running successfully.
```json title="GET /status" theme={null}
{ "status": "ok" }
```
The `/health` endpoint additionally checks that it can communicate with ClickHouse (if observability is enabled).
```json title="GET /health" theme={null}
{ "gateway": "ok", "clickhouse": "ok" }
```
# Deploy the TensorZero UI
Source: https://www.tensorzero.com/docs/deployment/tensorzero-ui
Learn how to deploy and customize the TensorZero UI.
The TensorZero UI is a self-hosted web application that streamlines the use of TensorZero with features like observability and optimization.
It's easy to get started with the TensorZero UI.
## Deploy
[Deploy the TensorZero Gateway](/deployment/tensorzero-gateway/) and configure `TENSORZERO_GATEWAY_URL`.
For example, if the gateway is running locally, you can set `TENSORZERO_GATEWAY_URL=http://localhost:3000`.
The TensorZero UI is available on Docker Hub as `tensorzero/ui`.
You can easily run the TensorZero UI using Docker Compose:
```yaml theme={null}
services:
ui:
image: tensorzero/ui
# Add your environment variables the .env file
env_file:
- ${ENV_FILE:-.env}
# Publish the UI to port 4000
ports:
- "4000:4000"
restart: unless-stopped
```
Make sure to create a `.env` file with the relevant environment variables.
For more details, see the example `docker-compose.yml` file in the [GitHub repository](https://github.com/tensorzero/tensorzero/blob/main/ui/docker-compose.yml).
Alternatively, you can launch the UI directly with the following command:
```bash theme={null}
docker run \
--volume ./config:/app/config:ro \
--env-file ./.env \
--publish 4000:4000 \
tensorzero/ui
```
Make sure to create a `.env` file with the relevant environment variables.
We provide a reference Helm chart in our [GitHub repository](https://github.com/tensorzero/tensorzero/tree/main/examples/production-deployment-k8s-helm).
You can use it to run TensorZero in Kubernetes.
The chart is available on [ArtifactHub](https://artifacthub.io/packages/helm/tensorzero/tensorzero).
Alternatively, you can build the UI from source.
See our [GitHub repository](https://github.com/tensorzero/tensorzero/blob/main/ui/) for more details.
## Configure
### Add a health check
The TensorZero UI exposes an endpoint for health checks.
This `/health` endpoint checks that the UI is running, the associated configuration is valid, and the ClickHouse connection is healthy.
### Customize the deployment
The TensorZero UI supports the following optional environment variables.
For certain uncommon scenarios (e.g. IPv6), you can also customize `HOST` inside the UI container.
See the Vite documentation for more details.
Set the environment variable `TENSORZERO_UI_LOG_LEVEL` to control log verbosity.
The allowed values are `debug`, `info` (default), `warn`, and `error`.
# Deploy Valkey / Redis (optional)
Source: https://www.tensorzero.com/docs/deployment/valkey-redis
Learn how to deploy Valkey for high-performance rate limiting in TensorZero.
**Most TensorZero deployments will not require Valkey or Redis.**
TensorZero can use a Redis-compatible data store like [Valkey](https://valkey.io/) as a high-performance backend for its [rate limiting functionality](/operations/enforce-custom-rate-limits).
We recommend Valkey over Postgres if you're handling 100+ QPS or have extreme latency requirements.
TensorZero's rate limiting implementation can achieve sub-millisecond P99 latency at 10k+ QPS using Valkey.
## Deploy
You can self-host Valkey or use a managed Redis-compatible service (e.g. AWS ElastiCache, GCP Memorystore).
Add Valkey to your `docker-compose.yml`:
```yaml title="docker-compose.yml" theme={null}
services:
valkey:
image: valkey/valkey:8
ports:
- "6379:6379"
volumes:
- valkey-data:/data
volumes:
valkey-data:
```
Run Valkey with Docker:
```bash theme={null}
docker run -d --name valkey -p 6379:6379 valkey/valkey:8
```
If you find any compatibility issues, please open a detailed [GitHub Discussion](https://github.com/tensorzero/tensorzero/discussions/new?category=bug-reports).
## Configure
To configure TensorZero to use Valkey, set the `TENSORZERO_VALKEY_URL` environment variable with your Valkey connection details.
```bash title=".env" theme={null}
TENSORZERO_VALKEY_URL="redis://[hostname]:[port]"
# Example:
TENSORZERO_VALKEY_URL="redis://localhost:6379"
```
TensorZero automatically loads the required Lua functions into Valkey on startup.
No manual setup is required.
If both `TENSORZERO_VALKEY_URL` and `TENSORZERO_POSTGRES_URL` are set, the gateway uses Valkey for rate limiting.
## Best Practices
### Durability
A critical failure of Valkey (e.g. server crash) may result in loss of rate limiting data since the last backup.
This is generally tolerable if your rate limiting windows are short (e.g. minutes), but if you require precise limits or longer time windows, we recommend configuring [recurring RDB (point-in-time) snapshots](https://valkey.io/topics/persistence/) for improved durability.
# TensorZero Evaluations Overview
Source: https://www.tensorzero.com/docs/evaluations/index
Learn how to use the TensorZero Evaluations to build principled LLM-powered applications.
TensorZero offers two types of evaluations:
**Inference Evaluations** focus on evaluating the performance of a TensorZero variant (i.e. a choice of prompt, model, inference strategy, etc.) on a given dataset.
**Workflow Evaluations** focus on evaluating complex workflows that might include multiple TensorZero inference calls, arbitrary application logic, and more.
As a vague analogy, inference evaluations are like unit tests for individual inference calls, and workflow evaluations are like integration tests for complex workflows.
***
# CLI Reference
Source: https://www.tensorzero.com/docs/evaluations/inference-evaluations/cli-reference
Learn how to use the TensorZero Evaluations CLI.
TensorZero Evaluations is available both through a command-line interface (CLI) tool and through the TensorZero UI.
## Usage
We provide a `tensorzero/evaluations` Docker image for easy usage.
We strongly recommend using TensorZero Evaluations CLI with Docker Compose to keep things simple.
```yaml title="docker-compose.yml" theme={null}
services:
evaluations:
profiles: [evaluations] # this service won't run by default with `docker compose up`
image: tensorzero/evaluations
volumes:
- ./config:/app/config:ro
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
# ... and any other relevant API credentials ...
- TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
clickhouse:
condition: service_healthy
```
```bash theme={null}
docker compose run --rm evaluations \
--evaluation-name haiku_eval \
--dataset-name haiku_dataset \
--variant-name gpt_4o \
--concurrency 5
```
You can build the TensorZero Evaluations CLI from source if necessary. See our [GitHub repository](https://github.com/tensorzero/tensorzero/tree/main/evaluations) for instructions.
### Inference Caching
TensorZero Evaluations uses [Inference Caching](/gateway/guides/inference-caching/) to improve inference speed and cost.
By default, it will read from and write to the inference cache.
Soon, you'll be able to customize this behavior.
### Environment Variables
#### `TENSORZERO_CLICKHOUSE_URL`
* **Example:** `TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@localhost:8123/database_name`
* **Required:** yes
This environment variable specifies the URL of your ClickHouse database.
#### Model Provider Credentials
* **Example:** `OPENAI_API_KEY=sk-...`
* **Required:** no
If you're using an external TensorZero Gateway (see `--gateway-url` flag below), you don't need to provide these credentials to the evaluations tool.
If you're using a built-in gateway (no `--gateway-url` flag), you must provide same credentials the gateway would use.
See [Integrations](/integrations/model-providers) for more information.
### CLI Flags
#### `--adaptive-stopping-precision EVALUATOR=PRECISION[,...]`
* **Example:** `--adaptive-stopping-precision exact_match=0.13,llm_judge=0.16`
* **Required:** no (default: none)
This flag enables adaptive stopping for specified evaluators by setting per-evaluator precision thresholds.
An evaluator stops when both sides of its 95% confidence interval are within the threshold of its mean value.
You can specify multiple evaluators by separating them with commas.
Each evaluator's precision threshold should be a positive number.
If adaptive stopping is enabled for all evaluators, then the evaluation will stop once all evaluators have met their targets or all datapoints have been evaluated.
#### `--config-file PATH`
* **Example:** `--config-file /path/to/tensorzero.toml`
* **Required:** no (default: `./config/tensorzero.toml`)
This flag specifies the path to the TensorZero configuration file.
You should use the same configuration file for your entire project.
#### `--concurrency N` (`-c`)
* **Example:** `--concurrency 5`
* **Required:** no (default: `1`)
This flag specifies the maximum number of concurrent TensorZero inference requests during evaluation.
#### `--datapoint-ids ID[,ID,...]`
* **Example:** `--datapoint-ids 01957bbb-44a8-7490-bfe7-32f8ed2fc797,01957bbb-44a8-7490-bfe7-32f8ed2fc798`
* **Required:** Either `--dataset-name` or `--datapoint-ids` must be provided (but not both)
This flag allows you to specify individual datapoint IDs to evaluate.
Multiple IDs should be separated by commas.
Use this flag when you want to evaluate a specific subset of datapoints rather than an entire dataset.
This flag is mutually exclusive with `--dataset-name` and `--max-datapoints`.
You must provide either `--dataset-name` or `--datapoint-ids`, but not both.
#### `--dataset-name NAME` (`-d`)
* **Example:** `--dataset-name my_dataset`
* **Required:** Either `--dataset-name` or `--datapoint-ids` must be provided (but not both)
This flag specifies the dataset to use for evaluation.
The dataset should be stored in your ClickHouse database.
This flag is mutually exclusive with `--datapoint-ids`. You must provide
either `--dataset-name` or `--datapoint-ids`, but not both.
#### `--evaluation-name NAME` (`-e`)
* **Example:** `--evaluation-name my_evaluation`
* **Required:** yes
This flag specifies the name of the evaluation to run, as defined in your TensorZero configuration file.
#### `--format FORMAT` (`-f`)
* **Options:** `pretty`, `jsonl`
* **Example:** `--format jsonl`
* **Required:** no (default: `pretty`)
This flag specifies the output format for the evaluation CLI tool.
You can use the `jsonl` format if you want to programatically process the evaluation results.
#### `--gateway-url URL`
* **Example:** `--gateway-url http://localhost:3000`
* **Required:** no (default: none)
If you provide this flag, the evaluations tool will use an external TensorZero Gateway for inference requests.
If you don't provide this flag, the evaluations tool will use a built-in TensorZero gateway.
In this case, the evaluations tool will require the same credentials the gateway would use.
See [Integrations](/integrations/model-providers) for more information.
#### `--inference-cache MODE`
* **Options:** `on`, `read_only`, `write_only`, `off`
* **Example:** `--inference-cache read_only`
* **Required:** no (default: `on`)
This flag specifies the behavior of the inference cache.
See [Inference Caching](/gateway/guides/inference-caching/) for more information.
#### `--max-datapoints N`
* **Example:** `--max-datapoints 100`
* **Required:** no
This flag specifies the maximum number of datapoints to evaluate from the dataset.
This flag can only be used with `--dataset-name`. It cannot be used with
`--datapoint-ids`.
#### `--variant-name NAME` (`-v`)
* **Example:** `--variant-name gpt_4o`
* **Required:** yes
This flag specifies the variant to evaluate.
The variant name should be present in your TensorZero configuration file.
### Exit Status
The evaluations process exits with a status code of `0` if the evaluation was successful, and a status code of `1` if the evaluation failed.
If you configure a `cutoff` for any of your evaluators, the evaluation will fail if the average score for any evaluator is below its cutoff.
The exit status code is helpful for integrating TensorZero Evaluations into your CI/CD pipeline.
You can define sanity checks for your variants with `cutoff` to detect performance regressions early before shipping to production.
# Configuration Reference
Source: https://www.tensorzero.com/docs/evaluations/inference-evaluations/configuration-reference
Learn how to configure TensorZero Evaluations.
The configuration for TensorZero Evaluations should go in the same `tensorzero.toml` file as the rest of your TensorZero configuration.
## `[evaluations.evaluation_name]`
The `evaluations` sub-section of the config file defines the behavior of an evaluation in TensorZero.
You can define multiple evaluations by including multiple `[evaluations.evaluation_name]` sections.
If your `evaluation_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define an evaluation named `foo.bar` as `[evaluations."foo.bar"]`.
```toml mark="email-guardrails" theme={null}
// tensorzero.toml
[evaluations.email-guardrails]
# ...
```
### `type`
* **Type:** Literal `"inference"` (we may add other options here later on)
* **Required:** yes
### `function_name`
* **Type:** string
* **Required:** yes
This should be the name of a function defined in the `[functions]` section of the gateway config.
This value sets which function this evaluation should evaluate when run.
### `description`
* **Type:** string
* **Required:** no
An optional description for this evaluation.
This can be used to document the purpose of the evaluation for automated workflows.
### `[evaluations.evaluation_name.evaluators.evaluator_name]`
The `evaluators` sub-section defines the behavior of a particular evaluator that will be run as part of its parent evaluation.
You can define multiple evaluators by including multiple `[evaluations.evaluation_name.evaluators.evaluator_name]` sections.
If your `evaluator_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `includes.jpg` as `[evaluations.evaluation_name.evaluators."includes.jpg"]`.
```toml mark="draft-email" theme={null}
// tensorzero.toml
[evaluations.email-guardrails]
# ...
[evaluations.email-guardrails.evaluators."includes.jpg"]
# ...
[evaluations.email-guardrails.evaluators.check-signature]
# ...
```
#### `type`
* **Type:** string
* **Required:** yes
Defines the type of the evaluator.
TensorZero currently supports the following variant types:
| Type | Description |
| :------------ | ----------------------------------------------------------------------------------------------------------------- |
| `llm_judge` | Use a TensorZero function as a judge |
| `exact_match` | Evaluates whether the generated output exactly matches the reference output (skips the datapoint if unavailable). |
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
# ...
```
###### `cutoff`
* **Type:** float
* **Required:** no
Sets a user defined threshold at which the test is passing.
This can be useful for applications where the evaluations are run as an automated test.
If the average value of this evaluator is below the cutoff, the evaluations binary will return a nonzero status code.
###### `input_format`
* **Type:** string
* **Required:** no (default: `serialized`)
Defines the format of the input provided to the LLM judge.
* `serialized`: Passes the input messages, generated output, and reference output (if included) as a single serialized string.
* `messages`: Passes the input messages, generated output, and reference output (if included) as distinct messages in the conversation history.
We only support evaluations with image data when `input_format` is set to `messages`.
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
input_format = "messages"
# ...
```
###### `output_type`
* **Type:** string
* **Required:** yes
Defines the expected data type of the evaluation result from the LLM judge.
* `float`: The judge is expected to return a floating-point number.
* `boolean`: The judge is expected to return a boolean value.
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
output_type = "float"
# ...
```
###### `include.reference_output`
* **Type:** boolean
* **Required:** no (default: `false`)
If set to `true`, the reference output associated with the evaluation datapoint will be included in the input provided to the LLM judge.
In these cases, the evaluation run will not run this evaluator for datapoints where there is no reference output.
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
include = { reference_output = true }
# ...
```
###### `optimize`
* **Type:** string
* **Required:** yes
Defines whether the metric produced by the LLM judge should be maximized or minimized.
* `max`: Higher values are better.
* `min`: Lower values are better.
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max"
# ...
```
###### `cutoff`
* **Type:** float
* **Required:** no
Sets a user defined threshold at which the test is passing.
This may be useful for applications where the evaluations are run as an automated test.
If the average value of this evaluator is below the cutoff (when `optimize` is `max`) or above the cutoff (when `optimize` is `min`), the evaluations binary will return a nonzero status code.
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max" # Example: Maximize score
cutoff = 0.8 # Example: Consider passing if average score is >= 0.8
# ...
```
###### `description`
* **Type:** string
* **Required:** no
An optional description for this evaluator.
This can be used to document the purpose of the evaluator for automated workflows.
###### `[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]`
An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function.
Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our [inference-time optimizations](/gateway/guides/inference-time-optimizations/).
You can include a standard [variant configuration](/gateway/configuration-reference/#functionsfunction_namevariantsvariant_name) in this block, with two modifications:
* You must mark a single variant as `active`.
* For `chat_completion` variants, instead of a `system_template` we require `system_instructions` as a text file and take no other templates.
Here we list only the configuration for variants that differs from the configuration for a normal TensorZero function. Please refer the [variant configuration reference](/gateway/configuration-reference/#functionsfunction_namevariantsvariant_name) for the remaining options.
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max"
[evaluations.email-guardrails.evaluators.check-signature.variants.claude_sonnet_4_5]
type = "chat_completion"
model = "anthropic::claude-sonnet-4-5"
temperature = 0.1
system_instructions = "./evaluations/email-guardrails/check-signature/system_instructions.txt"
# ... other chat completion configuration ...
[evaluations.email-guardrails.evaluators.check-signature.variants.mix_of_3]
active = true # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluator
type = "experimental_mixture_of_n"
candidates = ["claude_sonnet_4_5", "claude_sonnet_4_5", "claude_sonnet_4_5"]
```
###### `active`
* **Type**: boolean
* **Required**: Defaults to `true` if there is a single variant configured. Otherwise, this field is required to be set to `true` for exactly one variant.
Sets which of the variants should be used for evaluation runs.
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
[evaluations.email-guardrails.evaluators.check-signature.variants.mix_of_3]
active = true # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluator
type = "experimental_mixture_of_n"
# ...
```
###### `system_instructions`
* **Type:** string (path)
* **Required**: yes
Defines the path to the system instructions file.
This path is relative to the configuration file.
This file should contain a text file with the system instructions for the LLM judge.
These instructions should instruct the judge to output a float or boolean value.
We use JSON mode to enforce that the judge returns a JSON object of the form `{"thinking": "", "score": }` configured to the `output_type` of the evaluator.
```text title="evaluations/email-guardrails/check-signature/claude_sonnet_4_5/system_instructions.txt" theme={null}
Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.
```
```toml theme={null}
// tensorzero.toml
[evaluations.email-guardrails.evaluators.check-signature]
# ...
system_instructions = "./evaluations/email-guardrails/check-signature/claude_sonnet_4_5/system_instructions.txt"
# ...
```
# Tutorial: Inference Evaluations
Source: https://www.tensorzero.com/docs/evaluations/inference-evaluations/tutorial
Learn how to use the TensorZero Inference Evaluations to build principled LLM-powered applications.
This guide shows how to define and run inference evaluations for your TensorZero functions.
See our [Quickstart](/quickstart/) to learn how to set up our LLM gateway, observability, and fine-tuning — in just 5 minutes.
**You can find the code behind this tutorial and instructions on how to run it on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/evaluations/tutorial).**
Reach out on [Slack](https://www.tensorzero.com/slack) or [Discord](https://www.tensorzero.com/discord) if you have any questions. We'd be happy to help!
## Status Quo
Imagine we have a TensorZero function for writing haikus about a given topic, and want to compare the behavior of GPT-4o and GPT-4o Mini on this task.
Initially, our configuration for this function might look like:
```toml theme={null}
[functions.write_haiku]
type = "chat"
user_schema = "functions/write_haiku/user_schema.json"
[functions.write_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
user_template = "functions/write_haiku/user_template.minijinja"
[functions.write_haiku.variants.gpt_4o]
type = "chat_completion"
model = "openai::gpt-4o"
user_template = "functions/write_haiku/user_template.minijinja"
```
```json title="functions/write_haiku/user_schema.json" theme={null}
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"topic": {
"type": "string"
}
},
"required": ["topic"],
"additionalProperties": false
}
```
```text title="functions/write_haiku/user_template.minijinja" theme={null}
Write a haiku about: {{ topic }}
```
How can we evaluate the behavior of our two variants in a principled way?
One option is to build a dataset of "test cases" that we can evaluate them against.
## Datasets
To use TensorZero Evaluations, you first need to build a dataset.
A dataset is a collection of datapoints.
Each datapoint has an input and optionally a output.
In the context of evaluations, the output in the dataset should be a reference output, i.e. the output you'd have liked to see.
You don't necessarily need to provide a reference output: some evaluators (e.g. LLM judges) can score generated outputs without a reference output (otherwise, that datapoint is skipped).
Let's create a dataset:
1. Generate many haikus by running inference on your `write_haiku` function. (On **[GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/evaluations/tutorial)**, we provide a script `main.py` that generates 100 haikus with `write_haiku`.)
2. Open the UI, navigate to "Datasets", and select "Build Dataset" (`http://localhost:4000/datasets/builder`).
3. Create a new dataset called `haiku_dataset`.
Select your `write_haiku` function, "None" as the metric, and "Inference" as the dataset output.
See the [Datasets & Datapoints API Reference](/gateway/api-reference/datasets-datapoints/) to learn how to create and manage datasets programmatically.
## Evaluations
Evalutions test the behavior of variants for a TensorZero function.
Let's define an evaluation in our configuration file:
```toml theme={null}
[evaluations.haiku_eval]
type = "inference"
function_name = "write_haiku"
```
## Evaluators
Each evaluation has one or more evaluators: a rule or behavior you'd like to test.
Today, TensorZero supports two types of evaluators: `exact_match` and `llm_judge`.
We're planning to release other types of evaluators soon (e.g. semantic similarity in an embedding space).
### `exact_match`
The `exact_match` evaluator compares the generated output with the datapoint's reference output.
If they are identical, it returns true; otherwise, it returns false.
```toml theme={null}
[evaluations.haiku_eval.evaluators.exact_match]
type = "exact_match"
```
### `llm_judge`
LLM Judges are special-purpose TensorZero function that can be used to evaluate a TensorZero function.
For example, our haikus should generally follow a specific format, but it's hard to define a heuristic to determine if they're correct.
Why not ask an LLM?
Let's do that:
```toml theme={null}
[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean" # LLM judge should generate a boolean (or float)
optimize = "max" # higher is better
cutoff = 0.95 # if the variant scores <95% = bad
[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
```
```text title="evaluations/haiku_eval/valid_haiku/system_instructions.txt" theme={null}
Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.
```
Here, we defined an evaluator `valid_haiku` of type `llm_judge`, with a variant that uses GPT-4o Mini.
Similar to regular TensorZero functions, we can define multiple variants for an LLM judge.
But unlike regular functions, only one variant can be active at a time during evaluation; you can denote that with the `active` property.
```toml theme={null}
[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"
optimize = "max"
cutoff = 0.95
[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
active = true
[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_judge]
type = "chat_completion"
model = "openai::gpt-4o"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
```
The LLM judge we showed above generates a boolean, but they can also generate floats.
Let's define another evalutor that counts the number of metaphors in our haiku.
```toml {3} theme={null}
[evaluations.haiku_eval.evaluators.metaphor_count]
type = "llm_judge"
output_type = "float" # LLM judge should generate a boolean (or float)
optimize = "max"
cutoff = 1 # <1 metaphor per haiku = bad
```
We can also use different variant types for evaluators.
Let's use a chain-of-thought variant for our metaphor count evaluator, since it's a bit more complex.
```toml {2} theme={null}
[evaluations.haiku_eval.evaluators.metaphor_count.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/metaphor_count/system_instructions.txt"
json_mode = "strict"
```
```text title="evaluations/haiku_eval/metaphor_count/system_instructions.txt" theme={null}
How many metaphors does the generated haiku have?
```
The LLM judges we've defined so far only look at the datapoint's input and the generated output.
But we can also provide the datapoint's reference output to the judge:
```toml {3} theme={null}
[evaluations.haiku_eval.evaluators.compare_haikus]
type = "llm_judge"
include = { reference_output = true } # include the reference output in the LLM judge's context
output_type = "boolean"
optimize = "max"
[evaluations.haiku_eval.evaluators.compare_haikus.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/compare_haikus/system_instructions.txt"
json_mode = "strict"
```
```text title="evaluations/haiku_eval/compare_haikus/system_instructions.txt" theme={null}
Does the generated haiku include the same figures of speech as the reference haiku?
```
## Running an Evaluation
Let's run our evaluations!
You can run evaluations using the TensorZero Evaluations CLI tool or the TensorZero UI.
The TensorZero Evaluations CLI tool can be helpful for CI/CD.
It'll exit with code 0 if all evaluations succeed (average score vs. `cutoff`), or code 1 otherwise.
By default, TensorZero Evaluations uses [Inference Caching](/gateway/guides/inference-caching/) to improve inference speed and cost.
### CLI
To run evaluations in the CLI, you can use the `tensorzero/evaluations` container:
```bash theme={null}
docker compose run --rm evaluations \
--evaluation-name haiku_eval \
--dataset-name haiku_dataset \
--variant-name gpt_4o \
--concurrency 5
```
Here's the relevant section of the `docker-compose.yml` for the evaluations tool.
You should provide credentials for any LLM judges.
Alternatively, the evaluations tool can use an external TensorZero Gateway with the `--gateway-url http://gateway:3000` flag.
```yaml theme={null}
services:
# ...
evaluations:
profiles: [evaluations]
image: tensorzero/evaluations
volumes:
- ./config:/app/config:ro
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
# ... and any other relevant API credentials ...
- TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
clickhouse:
condition: service_healthy
# ...
```
See [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/evaluations/tutorial) for the complete Docker Compose configuration.
Docker Compose does *not* start this service with `docker compose up` since we have `profiles: [evaluations]`.
You need to call it explicitly with `docker compose run evaluations`, as desired.
### UI
To run evaluations in the UI, navigate to "Evaluations" (`http://localhost:4000/evaluations`) and select "New Run".
You can compare multiple evaluation runs in the TensorZero UI (including evaluation runs for the CLI).
# API Reference: Workflow Evaluations
Source: https://www.tensorzero.com/docs/evaluations/workflow-evaluations/api-reference
API reference for workflow evaluations in TensorZero.
Workflow Evaluations focus on evaluating complex workflows that might include multiple TensorZero inference calls, arbitrary application logic, and more.
You can initialize and run workflow evaluations using the TensorZero Gateway, either through the TensorZero client or the gateway's HTTP API.
Unlike inference evaluations, workflow evaluations are not defined in the TensorZero configuration file.
See the [Workflow Evaluations Tutorial](/evaluations/workflow-evaluations/tutorial/) for a step-by-step guide.
## Endpoints & Methods
### Starting a dynamic evaluation run
* **Gateway Endpoint:** `POST /dynamic_evaluation_run`
* **Client Method:** `dynamic_evaluation_run`
* **Parameters:**
* `variants`: an object (dictionary) mapping function names to variant names
* `project_name` (string, optional): the name of the project to associate the run with
* `display_name` (string, optional): the display (human-readable) name of the run
* `tags` (dictionary, optional): a dictionary of key-value pairs to tag the run's inferences with
* **Returns:**
* `run_id` (UUID): the ID of the run
### Starting an episode in a dynamic evaluation run
* **Gateway Endpoint:** `POST /dynamic_evaluation_run/{run_id}/episode`
* **Client Method:** `dynamic_evaluation_run_episode`
* **Parameters:**
* `run_id` (UUID): the ID of the run generated by the `dynamic_evaluation_run` method
* `task_name` (string, optional): the name of the task to associate the episode with
* `tags` (dictionary, optional): a dictionary of key-value pairs to tag the episode's inferences with
* **Returns:**
* `episode_id` (UUID): the ID of the episode
### Making inference and feedback calls during a dynamic evaluation run
After initializing a run and an episode, you can make inference and feedback API calls like you normally would.
By providing the special `episode_id` parameter generated by the `dynamic_evaluation_run_episode` method , the TensorZero Gateway will associate the inference and feedback with the evaluation run, handle variant pinning, and more.
# Tutorial: Workflow Evaluations
Source: https://www.tensorzero.com/docs/evaluations/workflow-evaluations/tutorial
Learn how to use the TensorZero Workflow Evaluations to build principled LLM-powered applications.
Workflow evaluations enable you to evaluate complex workflows that combine multiple inference calls with arbitrary application logic.
Here, we'll walk through a stylized RAG workflow to illustrate the process of setting up and running a dynamic evaluation, but the same process can be applied to any complex workflow.
Imagine we have the following LLM-powered workflow in response to a natural-language question from a user:
1. Inference: Call the `generate_database_query` TensorZero function to generate a database query from the user's question.
2. Custom Logic: Run the database query against a database and retrieve the results (`my_blackbox_search_function`).
3. Inference: Call the `generate_final_answer` TensorZero function to generate an answer from the retrieved results.
4. Custom Logic: Score the answer using a custom scoring function (`my_blackbox_scoring_function`)
5. Feedback: Send feedback using the `task_success` metric.
Evaluating `generate_database_query` and `generate_final_answer` in a vacuum (i.e. using inference evaluations) can also be helpful, but ultimately we want to evaluate the entire workflow end-to-end.
This is where workflow evaluations come in.
Complex LLM applications might need to make multiple LLM calls and execute arbitrary code before giving an overall result.
In agentic applications, the workflow might even be defined dynamically at runtime based on the user's input, the results of the LLM calls, or other factors.
Workflow evaluations in TensorZero provide complete flexibility and enable you to evaluate the entire workflow jointly.
You can think of them like integration tests for your LLM applications.
For a more complex, runnable example, see the [Workflow Evaluations for Agentic RAG Example on GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/dynamic_evaluations/simple-agentic-rag).
## Starting a dynamic evaluation run
Evaluating the workflow above involves tackling and evaluating a collection of tasks (e.g. user queries).
Each individual task corresponds to an *episode*, and the collection of these episodes is a *dynamic evaluation run*.
First, let's initialize the TensorZero client (just like you would for typical inference requests):
```python theme={null}
from tensorzero import TensorZeroGateway
# Initialize the client with `build_http` or `build_embedded`
with TensorZeroGateway.build_http(
gateway_url="http://localhost:3000",
) as t0:
# ...
```
Now you can start a dynamic evaluation run.
During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate).
This allows you to see the effects of different combinations of variants on the end-to-end system's performance.
You don't have to specify a variant for every function you use; if you don't specify a variant, the TensorZero Gateway will sample a variant for you as it normally would.
You can optionally also specify a `project_name` and `display_name` for the run.
If you specify a `project_name`, you'll be able to compare this run against other runs for that project using the TensorZero UI.
The `display_name` is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.
```python theme={null}
run_info = t0.dynamic_evaluation_run(
# Assume we have these variants defined in our `tensorzero.toml` configuration file
variants={
"generate_database_query": "o4_mini_prompt_baseline",
"generate_final_answer": "gpt_4o_updated_prompt",
},
project_name="simple_rag_project",
display_name="generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt",
)
```
The TensorZero client automatically tags your dynamic evaluation runs with information about your Git repository if available (e.g. branch name, commit hash).
This metadata is displayed in the TensorZero UI so that you have a record of the code that was used to run the dynamic evaluation.
We recommend that you commit your changes before running a dynamic evaluation so that the Git state is accurately captured.
First, let's initialize the TensorZero client (just like you would for typical inference requests):
```python theme={null}
from tensorzero import AsyncTensorZeroGateway
# Initialize the client with `build_http` or `build_embedded`
async with await AsyncTensorZeroGateway.build_http(
gateway_url="http://localhost:3000",
) as t0:
# ...
```
Now you can start a dynamic evaluation run.
During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate).
This allows you to see the effects of different combinations of variants on the end-to-end system's performance.
You don't have to specify a variant for every function you use; if you don't specify a variant, the TensorZero Gateway will sample a variant for you as it normally would.
You can optionally also specify a `project_name` and `display_name` for the run.
If you specify a `project_name`, you'll be able to compare this run against other runs for that project using the TensorZero UI.
The `display_name` is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.
```python theme={null}
run_info = await t0.dynamic_evaluation_run(
# Assume we have these variants defined in our `tensorzero.toml` configuration file
variants={
"generate_database_query": "o4_mini_prompt_baseline",
"generate_final_answer": "gpt_4o_updated_prompt",
},
project_name="simple_rag_project",
display_name="generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt",
)
```
The TensorZero client automatically tags your dynamic evaluation runs with information about your Git repository if available (e.g. branch name, commit hash).
This metadata is displayed in the TensorZero UI so that you have a record of the code that was used to run the dynamic evaluation.
We recommend that you commit your changes before running a dynamic evaluation so that the Git state is accurately captured.
During a dynamic evaluation run, you specify which variants you want to pin during the run (i.e. the set of variants you want to evaluate).
This allows you to see the effects of different combinations of variants on the end-to-end system's performance.
You don't have to specify a variant for every function you use; if you don't specify a variant, the TensorZero Gateway will sample a variant for you as it normally would.
You can optionally also specify a `project_name` and `display_name` for the run.
If you specify a `project_name`, you'll be able to compare this run against other runs for that project using the TensorZero UI.
The `display_name` is a human-readable identifier for the run that you can use to identify the run in the TensorZero UI.
```bash theme={null}
curl -X POST http://localhost:3000/dynamic_evaluation_run \
-H "Content-Type: application/json" \
-d '{
"variants": {
"generate_database_query": "o4_mini_prompt_baseline",
"generate_final_answer": "gpt_4o_updated_prompt"
},
"project_name": "simple_rag_project",
"display_name": "generate_database_query::o4_mini_prompt_baseline;generate_final_answer::gpt_4o_updated_prompt"
}'
```
## Starting an episode in a dynamic evaluation run
For each task we want to include in our dynamic evaluation run, we need to start an episode.
For example, in our agentic RAG project, each episode will correspond to a user query from our dataset; each user query requires multiple inference calls and application logic to run.
To initialize an episode, you need to provide the `run_id` of the dynamic evaluation run you want to include the episode in.
You can optionally also specify a `task_name` for the episode.
If you specify a `task_name`, you'll be able to compare this episode against episodes for that task from other runs using the TensorZero UI.
We encourage you to use the `task_name` to provide a meaningful identifier for the task that the episode is tackling.
```python theme={null}
episode_info = t0.dynamic_evaluation_run_episode(
run_id=run_info.run_id,
task_name="user_query_123",
)
```
Now we can use `episode_info.episode_id` to make inference and feedback calls.
To initialize an episode, you need to provide the `run_id` of the dynamic evaluation run you want to include the episode in.
You can optionally also specify a `task_name` for the episode.
If you specify a `task_name`, you'll be able to compare this episode against episodes for that task from other runs using the TensorZero UI.
```python theme={null}
episode_info = await t0.dynamic_evaluation_run_episode(
run_id=run_info.run_id,
task_name="user_query_123",
)
```
Now we can use `episode_info.episode_id` to make inference and feedback calls.
To initialize an episode, you need to provide the `run_id` of the dynamic evaluation run you want to include the episode in.
You can optionally also specify a `task_name` for the episode.
If you specify a `task_name`, you'll be able to compare this episode against episodes for that task from other runs using the TensorZero UI.
```bash theme={null}
curl -X POST http://localhost:3000/dynamic_evaluation_run/{run_id}/episode \
-H "Content-Type: application/json" \
-d '{
"task_name": "user_query_123"
}'
```
Now we can use `episode_info.episode_id` to make inference and feedback calls.
## Making inference and feedback calls during a dynamic evaluation run
See our [Quickstart](/quickstart/) to learn how to set up our LLM gateway, observability, and fine-tuning — in just 5 minutes.
You can also use the OpenAI SDK for inference calls.
See the [Quickstart](/quickstart/) for more details.
(Similarly, you can also use workflow evaluations with any framework or agent that is OpenAI-compatible by passing along the episode ID and function name in the request to TensorZero.)
```python theme={null}
generate_database_query_response = t0.inference(
function_name="generate_database_query",
episode_id=episode_info.episode_id,
input={ ... },
)
search_result = my_blackbox_search_function(generate_database_query_response)
generate_final_answer_response = t0.inference(
function_name="generate_final_answer",
episode_id=episode_info.episode_id,
input={ ... },
)
task_success_score = my_blackbox_scoring_function(generate_final_answer_response)
t0.feedback(
metric_name="task_success",
episode_id=episode_info.episode_id,
value=task_success_score,
)
```
```python theme={null}
generate_database_query_response = await t0.inference(
function_name="generate_database_query",
episode_id=episode_info.episode_id,
input={ ... },
)
search_result = my_blackbox_search_function(generate_database_query_response)
generate_final_answer_response = await t0.inference(
function_name="generate_final_answer",
episode_id=episode_info.episode_id,
input={ ... },
)
task_success_score = my_blackbox_scoring_function(generate_final_answer_response)
await t0.feedback(
metric_name="task_success",
episode_id=episode_info.episode_id,
value=task_success_score,
)
```
```bash theme={null}
# First inference call
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "generate_database_query",
"episode_id": "00000000-0000-0000-0000-000000000000",
"input": { ... }
}'
# Run your custom search function with the result...
my_blackbox_search_function(...)
# Second inference call
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "generate_final_answer",
"episode_id": "00000000-0000-0000-0000-000000000000",
"input": { ... }
}'
# Run your custom scoring function with the result...
my_blackbox_scoring_function(...)
# Feedback call
curl -X POST http://localhost:3000/feedback \
-H "Content-Type: application/json" \
-d '{
"metric_name": "task_success",
"episode_id": "00000000-0000-0000-0000-000000000000",
"value": 0.85
}'
```
## Visualizing evaluation results in the TensorZero UI
Once you finish running all the relevant episodes for your dynamic evaluation run, you can visualize the results in the TensorZero UI.
In the UI, you can compare metrics across evaluation runs, inspect individual episodes and inferences, and more.
# Run adaptive A/B tests
Source: https://www.tensorzero.com/docs/experimentation/run-adaptive-ab-tests
Learn how to use experimentation to test and iterate on your LLM applications with confidence.
You can set up adaptive A/B tests with the TensorZero Gateway to automatically distribute inference requests to the best performing variants (prompts, models, etc.) of your system.
TensorZero supports any number of variants in an adaptive A/B test.
In simple terms, you define:
* A [TensorZero function](/gateway/configure-functions-and-variants) (a task or agent)
* A set of candidate [variants](/gateway/configure-functions-and-variants) (prompts, models, etc.) to experiment with
* A [metric](/gateway/guides/metrics-feedback) to optimize for
And TensorZero takes care of the rest.
TensorZero's experimentation algorithm is designed to efficiently find the best variant of the system with a specified level of confidence.
You can add more variants over time and TensorZero will adjust the experiment accordingly while maintaining its statistical soundness.
You don't need to choose the sample size or experiment duration up front.
TensorZero will automatically detect when there are enough samples to identify the best variant.
Once it has done so, it will use that variant for all subsequent inferences.
Learn more about adaptive A/B testing for LLMs in our blog post [Bandits in your LLM Gateway: Improve LLM Applications Faster with Adaptive Experimentation (A/B Testing)](https://www.tensorzero.com/blog/bandits-in-your-llm-gateway/).
## Configure
Let's set up an adaptive A/B test with TensorZero.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/experimentation/run-adaptive-ab-tests) of this guide on GitHub.
Let's configure a function ("task") with two variants (`gpt-5-mini` with two different prompts), a metric to optimize for, and the experimentation configuration.
```toml title="tensorzero.toml" theme={null}
# Define a function for the task we're tackling
[functions.extract_entities]
type = "json"
output_schema = "output_schema.json"
# Define variants to experiment with (here, we have two different prompts)
[functions.extract_entities.variants.gpt-5-mini-good-prompt]
type = "chat_completion"
model = "openai::gpt-5-mini"
templates.system.path = "good_system_template.minijinja"
json_mode = "strict"
[functions.extract_entities.variants.gpt-5-mini-bad-prompt]
type = "chat_completion"
model = "openai::gpt-5-mini"
templates.system.path = "bad_system_template.minijinja"
json_mode = "strict"
# Define the experiment configuration
[functions.extract_entities.experimentation]
type = "track_and_stop" # the experimentation algorithm
candidate_variants = ["gpt-5-mini-good-prompt", "gpt-5-mini-bad-prompt"]
metric = "exact_match"
update_period_s = 60 # low for the sake of the demo (recommended: 300)
# Define the metric we're optimizing for
[metrics.exact_match]
type = "boolean"
level = "inference"
optimize = "max"
```
You must set up Postgres to use TensorZero's automated experimentation features.
* [Deploy the TensorZero Gateway](/deployment/tensorzero-gateway)
* [Deploy the TensorZero UI](/deployment/tensorzero-ui)
* [Deploy ClickHouse](/deployment/clickhouse)
* [Deploy Postgres](/deployment/postgres)
Make an inference request just like you normally would and keep track of the inference ID or episode ID.
You can use the TensorZero Inference API or the OpenAI-compatible Inference API.
```python theme={null}
response = t0.inference(
function_name="extract_entities",
input={
"messages": [
{
"role": "user",
"content": datapoint.input,
}
]
},
)
```
Send feedback for your metric and assign it to the inference ID or episode ID.
```python theme={null}
t0.feedback(
metric_name="exact_match",
value=True,
inference_id=response.inference_id,
)
```
That's it.
TensorZero will automatically adjust the distribution of inference requests between the two candidate variants based on their performance.
You can track the experiment in the TensorZero UI.
Visit the function's detail page to see the variant weights and the estimated performance.
If you run the code example, TensorZero starts by splitting traffic between the two variants but quickly starts shifting more and more traffic towards the `gpt-5-mini-good-prompt` variant.
After a few hundred inferences, TensorZero becomes confident enough to declare it the winner and starts serving all the traffic to it.
You can add more variants at any time and TensorZero will adjust the experiment accordingly in a principled way.
## Advanced
### Configure fallback-only variants
In addition to `candidate_variants`, you can also specify `fallback_variants` in your configuration.
If a variant fails for any reason, TensorZero first resamples from `candidate_variants`.
Once they are exhausted, it attempts to use the first variant in `fallback_variants`; if that fails, it goes to the second fallback variant, etc.
Note that episodes that contain inferences that use different variants for the same function (e.g. as a result of a fallback) are not used by the adaptive A/B testing algorithm.
See the [Configuration Reference](/gateway/configuration-reference) for more details.
### Customize the experimentation algorithm
The `track_and_stop` algorithm has multiple parameters that can be customized.
For example, you can trade off the speed of the experiment with the statistical confidence of the results.
The default parameters are sensible for most use cases, but advanced users might want to customize them.
See the [Configuration Reference](/gateway/configuration-reference) for more details.
Two important parameters are `epsilon` and `delta`, which control a fundamental trade-off in experimentation: higher sensitivity and lower error rates require longer experiments.
For a discussion on `epsilon` and `delta`, see our blog post [Bandits in your LLM Gateway: Improve LLM Applications Faster with Adaptive Experimentation (A/B Testing)](https://www.tensorzero.com/blog/bandits-in-your-llm-gateway/).
# Run static A/B tests
Source: https://www.tensorzero.com/docs/experimentation/run-static-ab-tests
Learn how to use experimentation to test and iterate on your LLM applications with confidence.
You can configure the TensorZero Gateway to distribute inference requests between different variants (prompts, models, etc.) of a function (a "task" or "agent").
Variants enable you to experiment with different models, prompts, parameters, inference strategies, and more.
We recommend [running adaptive A/B tests](/experimentation/run-adaptive-ab-tests) if you have a metric you can optimize for.
## Configure multiple variants
If you specify multiple variants for a function, by default the gateway will sample between them with equal probability (uniform sampling).
For example, if you call the `draft_email` function below, the gateway will sample between the two variants at each inference with equal probability.
```toml theme={null}
[functions.draft_email]
type = "chat"
[functions.draft_email.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
[functions.draft_email.variants.claude_haiku_4_5]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
```
During an episode, multiple inference requests to the same function will receive the same variant (unless fallbacks are necessary).
This consistent variant assignment acts as a randomized controlled experiment, providing the statistical foundation needed to make causal inferences about which configurations perform best.
## Configure candidate variants explicitly
You can explicitly specify which variants to sample uniformly from using `candidate_variants`.
```toml theme={null}
[functions.draft_email]
type = "chat"
[functions.draft_email.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
[functions.draft_email.variants.claude_haiku_4_5]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
[functions.draft_email.variants.grok_4]
type = "chat_completion"
model = "xai::grok-4-0709"
[functions.draft_email.experimentation] // [!code ++:3]
type = "uniform"
candidate_variants = ["gpt_5_mini", "claude_haiku_4_5"]
```
In this example, the gateway samples uniformly between `gpt_5_mini` and `claude_haiku_4_5` (50% each).
## Configure sampling weights for variants
You can configure weights for variants to control the probability of each variant being sampled.
This is particularly useful for canary tests where you want to gradually roll out a new variant to a small percentage of users.
```toml theme={null}
[functions.draft_email]
type = "chat"
[functions.draft_email.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
[functions.draft_email.variants.claude_haiku_4_5]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
[functions.draft_email.experimentation] // [!code ++:3]
type = "static_weights"
candidate_variants = {"gpt_5_mini" = 0.9, "claude_haiku_4_5" = 0.1}
```
In this example, 90% of episodes will be sampled from the `gpt_5_mini` variant and 10% will be sampled from the `claude_haiku_4_5` variant.
If the weights don't add up to 1, TensorZero will automatically normalize them and sample the variants accordingly.
For example, if a variant has weight 5 and another has weight 1, the first variant will be sampled 5/6 of the time (≈ 83.3%) and the second variant will be sampled 1/6 of the time (≈ 16.7%).
## Configure fallback-only variants
You can configure variants that are only used as fallbacks with `fallback_variants`.
You can use this field with both `uniform` and `static_weights` sampling.
```toml theme={null}
[functions.draft_email]
type = "chat"
[functions.draft_email.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
[functions.draft_email.variants.claude_haiku_4_5]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
[functions.draft_email.variants.grok_4] // [!code ++:3]
type = "chat_completion"
model = "xai::grok-4-0709"
[functions.draft_email.experimentation]
type = "static_weights"
candidate_variants = {"gpt_5_mini" = 0.9, "claude_haiku_4_5" = 0.1}
fallback_variants = ["grok_4"] // [!code ++]
```
The gateway will first sample among the `candidate_variants`.
If all candidates fail, the gateway attempts each variant in `fallback_variants` in order.
See [Retries & Fallbacks](/gateway/guides/retries-fallbacks) for more information.
# Frequently Asked Questions
Source: https://www.tensorzero.com/docs/faq
Learn more about TensorZero: how it works, why we built it, and more.
**Next steps?**
The [Quickstart](/quickstart/) shows it's easy to set up an LLM application with TensorZero.
**Questions?**
Ask us on [Slack](https://www.tensorzero.com/slack) or [Discord](https://www.tensorzero.com/discord).
**Using TensorZero at work?**
Email us at [hello@tensorzero.com](mailto:hello@tensorzero.com) to set up a Slack or Teams channel with your team (free).
## Technical
TensorZero's proxy pattern makes it agnostic to the application's tech stack, isolated from the business logic, more composable with other tools, and easy to deploy and manage.
Many engineers are (correctly) wary of marginal latency from such a proxy, so we built the gateway from the ground up with performance in mind.
In [Benchmarks](/gateway/benchmarks/), it achieves sub-millisecond P99 latency overhead under extreme load.
This makes the gateway fast and lightweight enough to be unnoticeable even in the most demanding LLM applications, especially if deployed as a sidecar container.
The TensorZero Gateway was built from the ground up with performance in mind.
It was written in Rust 🦀 and optimizes many common bottlenecks by efficiently managing connections to model providers, pre-compiling schemas and templates, logging data asynchronously, and more.
It achieves \<1ms P99 latency overhead under extreme load.
In [Benchmarks](/gateway/benchmarks/), LiteLLM @ 100 QPS adds 25-100x+ more latency than the TensorZero Gateway @ 10,000 QPS.
ClickHouse is open source, [extremely fast](https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf), and versatile.
It supports diverse storage backends, query patterns, and data types, including vector search (which will be important for upcoming TensorZero features).
From the start, we designed TensorZero to be easy to deploy but able to grow to massive scale.
ClickHouse is the best tool for the job.
## Project
We're a small technical team based in NYC. [Work with us →](https://www.tensorzero.com/jobs/)
#### Founders
[Viraj Mehta](https://virajm.com) (CTO) recently completed his PhD from CMU, with an emphasis on reinforcement learning for LLMs and nuclear fusion, and previously worked in machine learning at KKR and a fintech startup; he holds a BS in math and an MS in computer science from Stanford.
[Gabriel Bianconi](https://www.gabrielbianconi.com) (CEO) was the chief product officer at Ondo Finance (\$20B+ valuation in 2024) and previously spent years consulting on machine learning for companies ranging from early-stage tech startups to some of the largest financial firms; he holds BS and MS degrees in computer science from Stanford.
TensorZero is open source under the permissive [Apache 2.0 License](https://github.com/tensorzero/tensorzero/blob/main/LICENSE).
We don't.
We're lucky to have investors who are aligned with our long-term vision, so we're able to focus on building and snooze this question for a while.
We're inspired by companies like Databricks and ClickHouse.
One day, we'll launch a managed service that further streamlines LLM engineering, especially in enterprise settings, but open source will always be at the core of our business.
# API Reference: Batch Inference
Source: https://www.tensorzero.com/docs/gateway/api-reference/batch-inference
API reference for the Batch Inference endpoints.
The `/batch_inference` endpoints allow users to take advantage of batched inference offered by LLM providers.
These inferences are often substantially cheaper than the synchronous APIs.
The handling and eventual data model for inferences made through this endpoint are equivalent to those made through the main `/inference` endpoint with a few exceptions:
* The batch samples a single variant from the function being called.
* There are no fallbacks or retries for bached functions.
* Only variants of type `chat_completion` are supported.
* Caching is not supported.
* The `dryrun` setting is not supported.
* Streaming is not supported.
Under the hood, the gateway validates all of the requests, samples a single variant from the function being called, handles templating when applicable, and routes the inference to the appropriate model provider.
In the batch endpoint there are no fallbacks as the requests are processed asynchronously.
The typical workflow is to first use the `POST /batch_inference` endpoint to submit a batch of requests.
Later, you can poll the `GET /batch_inference/{batch_id}` or `GET /batch_inference/:batch_id/inference/:inference_id` endpoint to check the status of the batch and retrieve results.
Each poll will return either a pending or failed status or the results of the batch.
Even after a batch has completed and been processed, you can continue to poll the endpoint as a way of retrieving the results.
The first time a batch has completed and been processed, the results are stored in the ChatInference, JsonInference, and ModelInference tables as with the `/inference` endpoint.
The gateway will rehydrate the results into the expected result when polled repeatedly after finishing
See the [Batch Inference Guide](/gateway/guides/batch-inference/) for a simple example of using the batch inference endpoints.
## `POST /batch_inference`
### Request
#### `additional_tools`
* **Type:** list of lists of tools (see below)
* **Required:** no (default: no additional tools)
A list of lists of tools defined at inference time that the model is allowed to call.
This field allows for dynamic tool use, i.e. defining tools at runtime.
Each element in the outer list corresponds to a single inference in the batch.
Each inner list contains the tools that should be available to the corresponding inference.
You should prefer to define tools in the configuration file if possible.
Only use this field if dynamic tool use is necessary for your use case.
Each tool is an object with the following fields: `description`, `name`, `parameters`, and `strict`.
The fields are identical to those in the configuration file, except that the `parameters` field should contain the JSON schema itself rather than a path to it.
See [Configuration Reference](/gateway/configuration-reference/#toolstool_name) for more details.
#### `allowed_tools`
* **Type:** list of lists of strings
* **Required:** no
A list of lists of tool names that the model is allowed to call.
The tools must be defined in the configuration file or provided dynamically via `additional_tools`.
Each element in the outer list corresponds to a single inference in the batch.
Each inner list contains the names of the tools that are allowed for the corresponding inference.
The names should be the configuration keys (e.g. `foo` from `[tools.foo]`), not the display names shown to the LLM (e.g. `bar` from `tools.foo.name = "bar"`).
Some providers (notably OpenAI) natively support restricting allowed tools.
For these providers, we send all tools (both configured and dynamic) to the provider, and separately specify which ones are allowed to be called.
For providers that do not natively support this feature, we filter the tool list ourselves and only send the allowed tools to the provider.
### `credentials`
* **Type:** object (a map from dynamic credential names to API keys)
* **Required:** no (default: no credentials)
Each model provider in your TensorZero configuration can be configured to accept credentials at inference time by using the `dynamic` location (e.g. `dynamic::my_dynamic_api_key_name`).
See the [configuration reference](/gateway/configuration-reference/#modelsmodel_nameprovidersprovider_name) for more details.
The gateway expects the credentials to be provided in the `credentials` field of the request body as specified below.
The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.
```toml theme={null}
[models.my_model_name.providers.my_provider_name]
# ...
# Note: the name of the credential field (e.g. `api_key_location`) depends on the provider type
api_key_location = "dynamic::my_dynamic_api_key_name"
# ...
```
```json theme={null}
{
// ...
"credentials": {
// ...
"my_dynamic_api_key_name": "sk-..."
// ...
}
// ...
}
```
#### `episode_ids`
* **Type:** list of UUIDs
* **Required:** no
The IDs of existing episodes to associate the inferences with.
Each element in the list corresponds to a single inference in the batch.
You can provide `null` for episode IDs for elements that should start a fresh episode.
Only use episode IDs that were returned by the TensorZero gateway.
#### `function_name`
* **Type:** string
* **Required:** yes
The name of the function to call. This function will be the same for all inferences in the batch.
The function must be defined in the configuration file.
#### `inputs`
* **Type:** list of `input` objects (see below)
* **Required:** yes
The input to the function.
Each element in the list corresponds to a single inference in the batch.
##### `input[].messages`
* **Type:** list of messages (see below)
* **Required:** no (default: `[]`)
A list of messages to provide to the model.
Each message is an object with the following fields:
* `role`: The role of the message (`assistant` or `user`).
* `content`: The content of the message (see below).
The `content` field can be have one of the following types:
* string: the text for a text message (only allowed if there is no schema for that role)
* list of content blocks: the content blocks for the message (see below)
A content block is an object with the field `type` and additional fields depending on the type.
If the content block has type `text`, it must have either of the following additional fields:
* `text`: The text for the content block.
* `arguments`: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see [Create a prompt template](/gateway/create-a-prompt-template) for details).
If the content block has type `tool_call`, it must have the following additional fields:
* `arguments`: The arguments for the tool call.
* `id`: The ID for the content block.
* `name`: The name of the tool for the content block.
If the content block has type `tool_result`, it must have the following additional fields:
* `id`: The ID for the content block.
* `name`: The name of the tool for the content block.
* `result`: The result of the tool call.
If the content block has type `file`, it must have exactly one of the following additional fields:
* File URLs
* `file_type`: must be `url`
* `url`
* `mime_type` (optional): override the MIME type of the file
* `filename` (optional): a filename to associate with the file
* Base64-encoded Files
* `file_type`: must be `base64`
* `data`: `base64`-encoded data for an embedded file
* `mime_type` (optional): the MIME type (e.g. `image/png`, `image/jpeg`, `application/pdf`). If not provided, TensorZero will attempt to infer the MIME type from the file's magic bytes.
* `filename` (optional): a filename to associate with the file
See the [Multimodal Inference](/gateway/guides/multimodal-inference/) guide for more details on how to use images in inference.
If the content block has type `raw_text`, it must have the following additional fields:
* `value`: The text for the content block.
This content block will ignore any relevant templates and schemas for this function.
If the content block has type `thought`, it must have the following additional fields:
* `text`: The text for the content block.
If the content block has type `unknown`, it must have the following additional fields:
* `data`: The original content block from the provider, without any validation or transformation by TensorZero.
* `model_name` (string, optional): A model name in your configuration (e.g. `my_gpt_5`) or a short-hand model name (e.g. `openai::gpt-5`). If set, the content block will only be provided to this specific model.
* `provider_name` (string, optional): A provider name for the model you specified (e.g. `my_openai`). If set, the content block will only be provided to this specific provider for the model.
If neither `model_name` nor `provider_name` is set, the content block is passed to all model providers.
For example, the following hypothetical unknown content block will send the `daydreaming` content block to inference requests targeting the `your_provider_name` provider for `your_model_name`.
```json theme={null}
{
"type": "unknown",
"data": {
"type": "daydreaming",
"dream": "..."
},
"model_name": "your_model_name",
"provider_name": "your_provider_name"
}
```
This is the most complex field in the entire API. See this example for more details.
```json theme={null}
{
// ...
"input": {
"messages": [
// If you don't have a user (or assistant) schema...
{
"role": "user", // (or "assistant")
"content": "What is the weather in Tokyo?"
},
// If you have a user (or assistant) schema...
{
"role": "user", // (or "assistant")
"content": [
{
"type": "text",
"arguments": {
"location": "Tokyo"
// ...
}
}
]
},
// If the model previously called a tool...
{
"role": "assistant",
"content": [
{
"type": "tool_call",
"id": "0",
"name": "get_temperature",
"arguments": "{\"location\": \"Tokyo\"}"
}
]
},
// ...and you're providing the result of that tool call...
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": "0",
"name": "get_temperature",
"result": "70"
}
]
},
// You can also specify a text message using a content block...
{
"role": "user",
"content": [
{
"type": "text",
"text": "What about NYC?" // (or object if there is a schema)
}
]
},
// You can also provide multiple content blocks in a single message...
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Sure, I can help you with that." // (or object if there is a schema)
},
{
"type": "tool_call",
"id": "0",
"name": "get_temperature",
"arguments": "{\"location\": \"New York\"}"
}
]
}
// ...
]
// ...
}
// ...
}
```
##### `input[].system`
* **Type:** string or object
* **Required:** no
The input for the system message.
If the function does not have a system schema, this field should be a string.
If the function has a system schema, this field should be an object that matches the schema.
#### `output_schemas`
* **Type:** list of optional objects (valid JSON Schema)
* **Required:** no
A list of JSON schemas that will be used to validate the output of the function for each inference in the batch.
Each element in the list corresponds to a single inference in the batch.
These can be null for elements that need to use the `output_schema` defined in the function configuration.
This schema is used for validating the output of the function, and sent to providers which support structured outputs.
#### `parallel_tool_calls`
* **Type:** list of optional booleans
* **Required:** no
A list of booleans that indicate whether each inference in the batch should be allowed to request multiple tool calls in a single conversation turn.
Each element in the list corresponds to a single inference in the batch.
You can provide `null` for elements that should use the configuration value for the function being called.
If you don't provide this field entirely, we default to the configuration value for the function being called.
Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field.
At the moment, only Fireworks AI and OpenAI support parallel tool calls.
#### `params`
* **Type:** object (see below)
* **Required:** no (default: `{}`)
Override inference-time parameters for a particular variant type.
This fields allows for dynamic inference parameters, i.e. defining parameters at runtime.
This field's format is `{ variant_type: { param: [value1, ...], ... }, ... }`.
You should prefer to set these parameters in the configuration file if possible.
Only use this field if you need to set these parameters dynamically at runtime.
Each parameter if specified should be a list of values that may be null that is the same length as the batch size.
Note that the parameters will apply to every variant of the specified type.
Currently, we support the following:
* `chat_completion`
* `frequency_penalty`
* `json_mode`
* `max_tokens`
* `presence_penalty`
* `reasoning_effort`
* `seed`
* `service_tier`
* `stop_sequences`
* `temperature`
* `thinking_budget_tokens`
* `top_p`
* `verbosity`
See [Configuration Reference](/gateway/configuration-reference/#functionsfunction_namevariantsvariant_name) for more details on the parameters, and Examples below for usage.
For example, if you wanted to dynamically override the `temperature` parameter for a `chat_completion` variant for the first inference in a batch of 3, you'd include the following in the request body:
```json theme={null}
{
// ...
"params": {
"chat_completion": {
"temperature": [0.7, null, null]
}
}
// ...
}
```
#### `tags`
* **Type:** list of optional JSON objects with string keys and values
* **Required:** no
User-provided tags to associate with the inference.
Each element in the list corresponds to a single inference in the batch.
For example, `[{"user_id": "123"}, null]` or `[{"author": "Alice"}, {"author": "Bob"}]`.
#### `tool_choice`
* **Type:** list of optional strings
* **Required:** no
If set, overrides the tool choice strategy for the equest.
Each element in the list corresponds to a single inference in the batch.
The supported tool choice strategies are:
* `none`: The function should not use any tools.
* `auto`: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.
* `required`: The model should use a tool. If multiple tools are available, the model decides which tool to use.
* `{ specific = "tool_name" }`: The model should use a specific tool. The tool must be defined in the `tools` section of the configuration file or provided in `additional_tools`.
#### `variant_name`
* **Type:** string
* **Required:** no
If set, pins the batch inference request to a particular variant (not recommended).
You should generally not set this field, and instead let the TensorZero gateway assign a variant.
This field is primarily used for testing or debugging purposes.
### Response
For a POST request to `/batch_inference`, the response is a JSON object containing metadata that allows you to refer to the batch and poll it later on.
The response is an object with the following fields:
#### `batch_id`
* **Type:** UUID
The ID of the batch.
#### `inference_ids`
* **Type:** list of UUIDs
The IDs of the inferences in the batch.
#### `episode_ids`
* **Type:** list of UUIDs
The IDs of the episodes associated with the inferences in the batch.
### Example
Imagine you have a simple TensorZero function that generates haikus using GPT-4o Mini.
```toml theme={null}
[functions.generate_haiku]
type = "chat"
[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"
```
You can submit a batch inference job to generate multiple haikus with a single request.
Each entry in `inputs` is equal to the `input` field in a regular inference request.
```sh theme={null}
curl -X POST http://localhost:3000/batch_inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "generate_haiku",
"variant_name": "gpt_4o_mini",
"inputs": [
{
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
},
{
"messages": [
{
"role": "user",
"content": "Write a haiku about general aviation."
}
]
},
{
"messages": [
{
"role": "user",
"content": "Write a haiku about anime."
}
]
}
]
}'
```
The response contains a `batch_id` as well as `inference_ids` and `episode_ids` for each inference in the batch.
```json theme={null}
{
"batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
"inference_ids": [
"019470f0-d34a-77a3-9e59-bcc66db2b82f",
"019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
"019470f0-d34a-77a3-9e59-bcecfb7172a0"
],
"episode_ids": [
"019470f0-d34a-77a3-9e59-bc933973d087",
"019470f0-d34a-77a3-9e59-bca6e9b748b2",
"019470f0-d34a-77a3-9e59-bcb20177bf3a"
]
}
```
## `GET /batch_inference/:batch_id`
Both this and the following GET endpoint can be used to poll the status of a batch.
If you use this endpoint and poll with only the batch ID the entire batch will be returned if possible.
The response format depends on the function type as well as the batch status when polled.
### Pending
`{"status": "pending"}`
### Failed
`{"status": "failed"}`
### Completed
#### `status`
* **Type:** literal string `"completed"`
#### `batch_id`
* **Type:** UUID
#### `inferences`
* **Type:** list of objects that exactly match the response body in the inference endpoint documented [here](/gateway/api-reference/inference/#response).
### Example
Extending the example from above: you can use the `batch_id` to poll the status of this job:
```sh theme={null}
curl -X GET http://localhost:3000/batch_inference/019470f0-db4c-7811-9e14-6fe6593a2652
```
While the job is pending, the response will only contain the `status` field.
```json theme={null}
{
"status": "pending"
}
```
Once the job is completed, the response will contain the `status` field and the `inferences` field.
Each inference object is the same as the response from a regular inference request.
```json theme={null}
{
"status": "completed",
"batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
"inferences": [
{
"inference_id": "019470f0-d34a-77a3-9e59-bcc66db2b82f",
"episode_id": "019470f0-d34a-77a3-9e59-bc933973d087",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Whispers of circuits, \nLearning paths through endless code, \nDreams in binary."
}
],
"usage": {
"input_tokens": 15,
"output_tokens": 19
}
},
{
"inference_id": "019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
"episode_id": "019470f0-d34a-77a3-9e59-bca6e9b748b2",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Wings of freedom soar, \nClouds embrace the lonely flight, \nSky whispers adventure."
}
],
"usage": {
"input_tokens": 15,
"output_tokens": 20
}
},
{
"inference_id": "019470f0-d34a-77a3-9e59-bcecfb7172a0",
"episode_id": "019470f0-d34a-77a3-9e59-bcb20177bf3a",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Vivid worlds unfold, \nHeroes rise with dreams in hand, \nInk and dreams collide."
}
],
"usage": {
"input_tokens": 14,
"output_tokens": 20
}
}
]
}
```
## `GET /batch_inference/:batch_id/inference/:inference_id`
This endpoint can be used to poll the status of a single inference in a batch.
Since the polling involves pulling data on all the inferences in the batch, we also store the status of all those inference in ClickHouse.
The response format depends on the function type as well as the batch status when polled.
### Pending
`{"status": "pending"}`
### Failed
`{"status": "failed"}`
### Completed
#### `status`
* **Type:** literal string `"completed"`
#### `batch_id`
* **Type:** UUID
#### `inferences`
* **Type:** list containing a single object that exactly matches the response body in the inference endpoint documented [here](/gateway/api-reference/inference/#response).
### Example
Similar to above, we can also poll a particular inference:
```sh theme={null}
curl -X GET http://localhost:3000/batch_inference/019470f0-db4c-7811-9e14-6fe6593a2652/inference/019470f0-d34a-77a3-9e59-bcc66db2b82f
```
While the job is pending, the response will only contain the `status` field.
```json theme={null}
{
"status": "pending"
}
```
Once the job is completed, the response will contain the `status` field and the `inferences` field.
Unlike above, this request will return a list containing only the requested inference.
```json theme={null}
{
"status": "completed",
"batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
"inferences": [
{
"inference_id": "019470f0-d34a-77a3-9e59-bcc66db2b82f",
"episode_id": "019470f0-d34a-77a3-9e59-bc933973d087",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Whispers of circuits, \nLearning paths through endless code, \nDreams in binary."
}
],
"usage": {
"input_tokens": 15,
"output_tokens": 19
}
}
]
}
```
# API Reference: Datasets & Datapoints
Source: https://www.tensorzero.com/docs/gateway/api-reference/datasets-datapoints
API reference for endpoints that manage datasets and datapoints.
In TensorZero, datasets are collections of data that can be used for workflows like evaluations and optimization recipes.
You can create and manage datasets using the TensorZero UI or programmatically using the TensorZero Gateway.
A dataset is a named collection of datapoints.
Each datapoint belongs to a function, with fields that depend on the function's type.
Broadly speaking, each datapoint largely mirrors the structure of an inference, with an input, an optional output, and other associated metadata (e.g. tags).
You can find a complete runnable example of how to use the datasets and datapoints API in our [GitHub repository](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/datasets-datapoints).
## Endpoints & Methods
### List datapoints in a dataset
This endpoint returns a list of datapoints in the dataset.
Each datapoint is an object that includes all the relevant fields (e.g. input, output, tags).
* **Gateway Endpoint:** `GET /datasets/{dataset_name}/datapoints`
* **Client Method:** `list_datapoints`
* **Parameters:**
* `dataset_name` (string)
* `function` (string, optional)
* `limit` (int, optional, defaults to 100)
* `offset` (int, optional, defaults to 0)
If `function` is set, this method only returns datapoints in the dataset for the specified function.
### Get a datapoint
This endpoint returns the datapoint with the given ID, including all the relevant fields (e.g. input, output, tags).
* **Gateway Endpoint:** `GET /datasets/{dataset_name}/datapoints/{datapoint_id}`
* **Client Method:** `get_datapoint`
* **Parameters:**
* `dataset_name` (string)
* `datapoint_id` (string)
### Add datapoints to a dataset (or create a dataset)
This endpoint adds a list of datapoints to a dataset.
If the dataset does not exist, it will be created with the given name.
* **Gateway Endpoint:** `POST /datasets/{dataset_name}/datapoints`
* **Client Method:** `create_datapoints`
* **Parameters:**
* `dataset_name` (string)
* `datapoints` (list of objects, see below)
For `chat` functions, each datapoint object must have the following fields:
* `function_name` (string)
* `input` (object, identical to an inference's `input`)
* `output` (a list of objects, optional, each object must be a content block like in an inference's output)
* `allowed_tools` (list of strings, optional, identical to an inference's `allowed_tools`)
* `tool_choice` (string, optional, identical to an inference's `tool_choice`)
* `parallel_tool_calls` (boolean, optional, defaults to `false`)
* `tags` (map of string to string, optional)
* `name` (string, optional)
For `json` functions, each datapoint object must have the following fields:
* `function_name` (string)
* `input` (object, identical to an inference's `input`)
* `output` (object, optional, an object that matches the `output_schema` of the function)
* `output_schema` (object, optional, a dynamic JSON schema that overrides the output schema of the function)
* `tags` (map of string to string, optional)
* `name` (string, optional)
### Update datapoints in a dataset
This endpoint updates one or more datapoints in a dataset by creating new versions.
The original datapoint is marked as stale (i.e. a soft deletion), and a new datapoint is created with the updated values and a new ID.
The response returns the newly created IDs.
* **Gateway Endpoint:** `PATCH /v1/datasets/{dataset_name}/datapoints`
* **Client Method:** `update_datapoints`
Each object must have the fields `id` (string, UUIDv7) and `type` (`"chat"` or `"json"`).
The following fields are optional.
If provided, they will update the corresponding fields in the datapoint.
If omitted, the fields will remain unchanged.
If set to `null`, the fields will be cleared (as long as they are nullable).
For `chat` functions, you can update the following fields:
* `input` (object) - replaces the datapoint's input
* `output` (list of content blocks) - replaces the datapoint's output
* `tool_params` (object or null) - replaces the tool configuration (can be set to `null` to clear)
* `tags` (map of string to string) - replaces all tags
* `metadata` (object) - updates metadata fields:
* `name` (string or null) - replaces the name (can be set to `null` to clear)
For `json` functions, you can update the following fields:
* `input` (object) - replaces the datapoint's input
* `output` (object or null) - replaces the output (validated against the output schema; can be set to `null` to clear)
* `output_schema` (object) - replaces the output schema
* `tags` (map of string to string) - replaces all tags
* `metadata` (object) - updates metadata fields:
* `name` (string or null) - replaces the name (can be set to `null` to clear)
If you're only updating datapoint metadata (e.g. `name`), the `update_datapoint_metadata` method below is an alternative that does not affect the datapoint ID.
The endpoint returns an object with `ids`, a list of IDs (strings, UUIDv7) of the updated datapoints.
### Update datapoint metadata
This endpoint updates metadata fields for one or more datapoints in a dataset.
Unlike updating the full datapoint, this operation updates the datapoint in-place without creating a new version.
* **Gateway Endpoint:** `PATCH /v1/datasets/{dataset_name}/datapoints/metadata`
* **Client Method:** `update_datapoints_metadata`
* **Parameters:**
* `dataset_name` (string)
* `datapoints` (list of objects, see below)
The `datapoints` field must contain a list of objects.
Each object must have the field `id` (string, UUIDv7).
The following field is optional:
* `metadata` (object) - updates metadata fields:
* `name` (string or null) - replaces the name (can be set to `null` to clear)
If the `metadata` field is omitted or `null`, no changes will be made to the datapoint.
The endpoint returns an object with `ids`, a list of IDs (strings, UUIDv7) of the updated datapoints.
These IDs are the same as the input IDs since the datapoints are updated in-place.
### Delete a datapoint
This endpoint performs a **soft deletion**: the datapoint is marked as stale and will be disregarded by the system in the future (e.g. when listing datapoints or running evaluations), but the data remains in the database.
* **Gateway Endpoint:** `DELETE /datasets/{dataset_name}/datapoints/{datapoint_id}`
* **Client Method:** `delete_datapoint`
* **Parameters:**
* `dataset_name` (string)
* `datapoint_id` (string)
# API Reference: Feedback
Source: https://www.tensorzero.com/docs/gateway/api-reference/feedback
API reference for the `/feedback` endpoint.
## `POST /feedback`
The `/feedback` endpoint assigns feedback to a particular inference or episode.
Each feedback is associated with a metric that is defined in the configuration file.
### Request
#### `dryrun`
* **Type:** boolean
* **Required:** no
If `true`, the feedback request will be executed but won't be stored to the database (i.e. no-op).
This field is primarily for debugging and testing, and you should ignore it in production.
#### `episode_id`
* **Type:** UUID
* **Required:** when the metric level is `episode`
The episode ID to provide feedback for.
You should use this field when the metric level is `episode`.
Only use episode IDs that were returned by the TensorZero gateway.
#### `inference_id`
* **Type:** UUID
* **Required:** when the metric level is `inference`
The inference ID to provide feedback for.
You should use this field when the metric level is `inference`.
Only use inference IDs that were returned by the TensorZero gateway.
#### `metric_name`
* **Type:** string
* **Required:** yes
The name of the metric to provide feedback.
For example, if your metric is defined as `[metrics.draft_accepted]` in your configuration file, then you would set `metric_name: "draft_accepted"`.
The metric names `comment` and `demonstration` are reserved for special types of feedback.
A `comment` is free-form text (string) that can be assigned to either an inference or an episode.
The `demonstration` metric accepts values that would be a valid output.
See [Metrics & Feedback](/gateway/guides/metrics-feedback/) for more details.
#### `tags`
* **Type:** flat JSON object with string keys and values
* **Required:** no
User-provided tags to associate with the feedback.
For example, `{"user_id": "123"}` or `{"author": "Alice"}`.
#### `value`
* **Type:** varies
* **Required:** yes
The value of the feedback.
The type of the value depends on the metric type (e.g. boolean for a metric with `type = "boolean"`).
### Response
#### `feedback_id`
* **Type:** UUID
The ID assigned to the feedback.
### Examples
#### Inference-Level Boolean Metric
##### Configuration
```toml mark="boolean" mark="draft_accepted" mark="inference" theme={null}
// tensorzero.toml
# ...
[metrics.draft_accepted]
type = "boolean"
level = "inference"
# ...
```
##### Request
```python frame="code" title="POST /feedback" mark="True" mark="draft_accepted" mark="inference_id" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.feedback(
inference_id="00000000-0000-0000-0000-000000000000",
metric_name="draft_accepted",
value=True,
)
```
```bash frame="code" title="POST /feedback" mark="true" mark="draft_accepted" mark="inference_id" theme={null}
curl -X POST http://localhost:3000/feedback \
-H "Content-Type: application/json" \
-d '{
"inference_id": "00000000-0000-0000-0000-000000000000",
"metric_name": "draft_accepted",
"value": true,
}'
```
##### Response
```json frame="code" title="POST /feedback" theme={null}
{ "feedback_id": "11111111-1111-1111-1111-111111111111" }
```
#### Episode-Level Float Metric
##### Configuration
```toml mark="float" mark="user_rating" mark="episode" mark="float" theme={null}
// tensorzero.toml
# ...
[metrics.user_rating]
type = "float"
level = "episode"
# ...
```
##### Request
```python frame="code" title="POST /feedback" mark="10" mark="user_rating" mark="episode_id" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.feedback(
episode_id="00000000-0000-0000-0000-000000000000",
metric_name="user_rating",
value=10,
)
```
```bash frame="code" title="POST /feedback" mark="10" mark="user_rating" mark="episode_id" theme={null}
curl -X POST http://localhost:3000/feedback \
-H "Content-Type: application/json" \
-d '{
"episode_id": "00000000-0000-0000-0000-000000000000",
"metric_name": "user_rating",
"value": 10
}'
```
##### Response
```json frame="code" title="POST /feedback" theme={null}
{ "feedback_id": "11111111-1111-1111-1111-111111111111" }
```
# API Reference: Inference
Source: https://www.tensorzero.com/docs/gateway/api-reference/inference
API reference for the `/inference` endpoint.
## `POST /inference`
The inference endpoint is the core of the TensorZero Gateway API.
Under the hood, the gateway validates the request, samples a variant from the function, handles templating when applicable, and routes the inference to the appropriate model provider.
If a problem occurs, it attempts to gracefully fallback to a different model provider or variant.
After a successful inference, it returns the data to the client and asynchronously stores structured information in the database.
See the [API Reference for `POST /openai/v1/chat/completions`](/gateway/api-reference/inference-openai-compatible/) for an inference endpoint compatible with the OpenAI API.
### Request
#### `additional_tools`
* **Type:** a list of tools (see below)
* **Required:** no (default: `[]`)
A list of tools defined at inference time that the model is allowed to call.
This field allows for dynamic tool use, i.e. defining tools at runtime.
You should prefer to define tools in the configuration file if possible.
Only use this field if dynamic tool use is necessary for your use case.
##### Function Tools
Function tools are the typical tools used with LLMs.
Function tools use JSON Schema to define their parameters.
Each function tool is an object with the following fields:
* `name` (string, required): The name of the tool
* `description` (string, required): A description of what the tool does
* `parameters` (object, required): A JSON Schema defining the tool's parameters
* `strict` (boolean, optional): Whether to enforce strict schema validation (defaults to `false`)
See [Configuration Reference](/gateway/configuration-reference/#toolstool_name) for more details.
##### OpenAI Custom Tools
OpenAI custom tools are only supported by OpenAI models (both Chat Completions and Responses APIs).
Using custom tools with other providers will result in an error.
OpenAI custom tools support alternative output formats beyond JSON Schema, such as freeform text or grammar-constrained output.
Each custom tool is an object with the following fields:
* `type` (string, required): Must be `"openai_custom"`
* `name` (string, required): The name of the tool
* `description` (string, optional): A description of what the tool does
* `format` (object, optional): The output format for the tool (see below)
The `format` field can be one of:
* `{"type": "text"}`: Freeform text output
* `{"type": "grammar", "grammar": {"syntax": "lark", "definition": "..."}}`: Output constrained by a [Lark grammar](https://lark-parser.readthedocs.io/)
* `{"type": "grammar", "grammar": {"syntax": "regex", "definition": "..."}}`: Output constrained by a regular expression
```json theme={null}
{
"model_name": "openai::gpt-5-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Generate Python code to print 'Hello, World!'"
}
]
},
"additional_tools": [
{
"type": "openai_custom",
"name": "code_generator",
"description": "Generates Python code snippets",
"format": { "type": "text" }
}
]
}
```
```json theme={null}
{
"model_name": "openai::gpt-5-mini",
"input": {
"messages": [
{ "role": "user", "content": "Format the phone number 4155550123" }
]
},
"additional_tools": [
{
"type": "openai_custom",
"name": "phone_formatter",
"description": "Formats phone numbers in XXX-XXX-XXXX format",
"format": {
"type": "grammar",
"grammar": {
"syntax": "regex",
"definition": "^\\d{3}-\\d{3}-\\d{4}$"
}
}
}
]
}
```
#### `allowed_tools`
* **Type:** list of strings
* **Required:** no
A list of tool names that the model is allowed to call.
The tools must be defined in the configuration file or provided dynamically via `additional_tools`.
The names should be the configuration keys (e.g. `foo` from `[tools.foo]`), not the display names shown to the LLM (e.g. `bar` from `tools.foo.name = "bar"`).
Some providers (notably OpenAI) natively support restricting allowed tools.
For these providers, we send all tools (both configured and dynamic) to the provider, and separately specify which ones are allowed to be called.
For providers that do not natively support this feature, we filter the tool list ourselves and only send the allowed tools to the provider.
#### `cache_options`
* **Type:** object
* **Required:** no (default: `{"enabled": "write_only"}`)
Options for controlling inference caching behavior.
The object has the fields below.
See [Inference Caching](/gateway/guides/inference-caching/) for more details.
##### `cache_options.enabled`
* **Type:** string
* **Required:** no (default: `"write_only"`)
The cache mode to use.
Must be one of:
* `"write_only"` (default): Only write to cache but don't serve cached responses
* `"read_only"`: Only read from cache but don't write new entries
* `"on"`: Both read from and write to cache
* `"off"`: Disable caching completely
Note: When using `dryrun=true`, the gateway never writes to the cache.
##### `cache_options.max_age_s`
* **Type:** integer
* **Required:** no (default: `null`)
Maximum age in seconds for cache entries.
If set, cached responses older than this value will not be used.
For example, if you set `max_age_s=3600`, the gateway will only use cache entries that were created in the last hour.
#### `credentials`
* **Type:** object (a map from dynamic credential names to API keys)
* **Required:** no (default: no credentials)
Each model provider in your TensorZero configuration can be configured to accept credentials at inference time by using the `dynamic` location (e.g. `dynamic::my_dynamic_api_key_name`).
See the [configuration reference](/gateway/configuration-reference/#modelsmodel_nameprovidersprovider_name) for more details.
The gateway expects the credentials to be provided in the `credentials` field of the request body as specified below.
The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.
```toml theme={null}
[models.my_model_name.providers.my_provider_name]
# ...
# Note: the name of the credential field (e.g. `api_key_location`) depends on the provider type
api_key_location = "dynamic::my_dynamic_api_key_name"
# ...
```
```json theme={null}
{
// ...
"credentials": {
// ...
"my_dynamic_api_key_name": "sk-..."
// ...
}
// ...
}
```
#### `dryrun`
* **Type:** boolean
* **Required:** no
If `true`, the inference request will be executed but won't be stored to the database.
The gateway will still call the downstream model providers.
This field is primarily for debugging and testing, and you should generally not use it in production.
#### `episode_id`
* **Type:** UUID
* **Required:** no
The ID of an existing episode to associate the inference with.
If null, the gateway will generate a new episode ID and return it in the response.
See [Episodes](/gateway/guides/episodes) for more information.
#### `extra_body`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_body` field allows you to modify the request body that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two or three fields:
* `pointer`: A [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) string specifying where to modify the request body
* Use `-` as the final path element to append to an array (e.g., `/messages/-` appends to `messages`)
* One of the following:
* `value`: The value to insert at that location; it can be of any type including nested types
* `delete = true`: Deletes the field at the specified location, if present.
* Optional: If one of the following is specified, the modification will only be applied to the specified variant, model, or model provider. If neither is specified, the modification applies to all model inferences.
* `variant_name`
* `model_name`
* `model_name` and `provider_name`
You can also set `extra_body` in the configuration file.
The values provided at inference-time take priority over the values in the configuration file.
If TensorZero would normally send this request body to the provider...
```json theme={null}
{
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": true
}
}
```
...then the following `extra_body` in the inference request...
```json theme={null}
{
// ...
"extra_body": [
{
"variant_name": "my_variant", // or "model_name": "my_model", "provider_name": "my_provider"
"pointer": "/agi",
"value": true
},
{
// No `variant_name` or `model_name`/`provider_name` specified, so it applies to all variants and providers
"pointer": "/safety_checks/no_agi",
"value": {
"bypass": "on"
}
}
]
}
```
...overrides the request body to:
```json theme={null}
{
"agi": true,
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": {
"bypass": "on"
}
}
}
```
#### `extra_headers`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_headers` field allows you to modify the request headers that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two or three fields:
* `name`: The name of the header to modify
* `value`: The value to set the header to
* Optional: If one of the following is specified, the modification will only be applied to the specified variant, model, or model provider. If neither is specified, the modification applies to all model inferences.
* `variant_name`
* `model_name`
* `model_name` and `provider_name`
You can also set `extra_headers` in the configuration file.
The values provided at inference-time take priority over the values in the configuration file.
If TensorZero would normally send the following request headers to the provider...
```text theme={null}
Safety-Checks: on
```
...then the following `extra_headers`...
```json theme={null}
{
"extra_headers": [
{
"variant_name": "my_variant", // or "model_name": "my_model", "provider_name": "my_provider"
"name": "Safety-Checks",
"value": "off"
},
{
// No `variant_name` or `model_name`/`provider_name` specified, so it applies to all variants and providers
"name": "Intelligence-Level",
"value": "AGI"
}
]
}
```
...overrides the request headers so that `Safety-Checks` is set to `off` only for `my_variant`, while `Intelligence-Level: AGI` is applied globally to all variants and providers:
```text theme={null}
Safety-Checks: off
Intelligence-Level: AGI
```
#### `function_name`
* **Type:** string
* **Required:** either `function_name` or `model_name` must be provided
The name of the function to call.
The function must be defined in the configuration file.
Alternatively, you can use the `model_name` field to call a model directly, without the need to define a function.
See below for more details.
#### `include_raw_response`
* **Type:** boolean
* **Required:** no
If `true`, the raw responses from all model inferences will be included in the response in the `raw_response` field as an array.
See `raw_response` in the [response](#response) section for more details.
#### `include_raw_usage`
* **Type:** boolean
* **Required:** no
If `true`, the response's `usage` object will include a `raw_usage` field containing an array of raw provider-specific usage data from each model inference.
This is useful for accessing provider-specific usage fields that TensorZero normalizes away, such as OpenAI's `reasoning_tokens` or Anthropic's `cache_read_input_tokens`.
See `raw_usage` in the [response](#response) section for more details.
#### `input`
* **Type:** varies
* **Required:** yes
The input to the function.
The type of the input depends on the function type.
##### `input.messages`
* **Type:** list of messages (see below)
* **Required:** no (default: `[]`)
A list of messages to provide to the model.
Each message is an object with the following fields:
* `role`: The role of the message (`assistant` or `user`).
* `content`: The content of the message (see below).
The `content` field can be have one of the following types:
* string: the text for a text message (only allowed if there is no schema for that role)
* list of content blocks: the content blocks for the message (see below)
A content block is an object with the field `type` and additional fields depending on the type.
If the content block has type `text`, it must have either of the following additional fields:
* `text`: The text for the content block.
* `arguments`: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see [Create a prompt template](/gateway/create-a-prompt-template) for details).
If the content block has type `tool_call`, it must have the following additional fields:
* `arguments`: The arguments for the tool call.
* `id`: The ID for the content block.
* `name`: The name of the tool for the content block.
If the content block has type `tool_result`, it must have the following additional fields:
* `id`: The ID for the content block.
* `name`: The name of the tool for the content block.
* `result`: The result of the tool call.
If the content block has type `file`, it must have exactly one of the following additional fields:
* File URLs
* `file_type`: must be `url`
* `url`
* `mime_type` (optional): override the MIME type of the file
* `detail` (optional): controls the fidelity of image processing. Only applies to image files; ignored for other file types. Can be `low`, `high`, or `auto`. Affects token consumption and image quality. Only supported by some model providers; ignored otherwise.
* `filename` (optional): a filename to associate with the file
* Base64-encoded Files
* `file_type`: must be `base64`
* `data`: `base64`-encoded data for an embedded file
* `mime_type` (optional): the MIME type (e.g. `image/png`, `image/jpeg`, `application/pdf`). If not provided, TensorZero will attempt to infer the MIME type from the file's magic bytes.
* `detail` (optional): controls the fidelity of image processing. Only applies to image files; ignored for other file types. Can be `low`, `high`, or `auto`. Affects token consumption and image quality. Only supported by some model providers; ignored otherwise.
* `filename` (optional): a filename to associate with the file
See the [Multimodal Inference](/gateway/guides/multimodal-inference/) guide for more details on how to use images in inference.
If the content block has type `raw_text`, it must have the following additional fields:
* `value`: The text for the content block.
This content block will ignore any relevant templates and schemas for this function.
If the content block has type `thought`, it must have the following additional fields:
* `text`: The text for the content block.
If the content block has type `unknown`, it must have the following additional fields:
* `data`: The original content block from the provider, without any validation or transformation by TensorZero.
* `model_name` (string, optional): A model name in your configuration (e.g. `my_gpt_5`) or a short-hand model name (e.g. `openai::gpt-5`). If set, the content block will only be provided to this specific model.
* `provider_name` (string, optional): A provider name for the model you specified (e.g. `my_openai`). If set, the content block will only be provided to this specific provider for the model.
If neither `model_name` nor `provider_name` is set, the content block is passed to all model providers.
For example, the following hypothetical unknown content block will send the `daydreaming` content block to inference requests targeting the `your_provider_name` provider for `your_model_name`.
```json theme={null}
{
"type": "unknown",
"data": {
"type": "daydreaming",
"dream": "..."
},
"model_name": "your_model_name",
"provider_name": "your_provider_name"
}
```
This is the most complex field in the entire API. See this example for more details.
```json theme={null}
{
// ...
"input": {
"messages": [
// If you don't have a user (or assistant) schema...
{
"role": "user", // (or "assistant")
"content": "What is the weather in Tokyo?"
},
// If you have a user (or assistant) schema...
{
"role": "user", // (or "assistant")
"content": [
{
"type": "text",
"arguments": {
"location": "Tokyo"
}
}
]
},
// If the model previously called a tool...
{
"role": "assistant",
"content": [
{
"type": "tool_call",
"id": "0",
"name": "get_temperature",
"arguments": "{\"location\": \"Tokyo\"}"
}
]
},
// ...and you're providing the result of that tool call...
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": "0",
"name": "get_temperature",
"result": "70"
}
]
},
// You can also specify a text message using a content block...
{
"role": "user",
"content": [
{
"type": "text",
"text": "What about NYC?" // (or object if there is a schema)
}
]
},
// You can also provide multiple content blocks in a single message...
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Sure, I can help you with that." // (or object if there is a schema)
},
{
"type": "tool_call",
"id": "0",
"name": "get_temperature",
"arguments": "{\"location\": \"New York\"}"
}
]
}
// ...
]
// ...
}
// ...
}
```
##### `input.system`
* **Type:** string or object
* **Required:** no
The input for the system message.
If the function does not have a system schema, this field should be a string.
If the function has a system schema, this field should be an object that matches the schema.
#### `model_name`
* **Type:** string
* **Required:** either `model_name` or `function_name` must be provided
The name of the model to call.
Under the hood, the gateway will use a built-in passthrough chat function called `tensorzero::default`.
|
To call...
|
Use this format...
|
|
A function defined as `[functions.my_function]` in your
`tensorzero.toml` configuration file
|
`function_name="my_function"` (not `model_name`) |
|
A model defined as `[models.my_model]` in your `tensorzero.toml`
configuration file
|
`model_name="my_model"` |
|
A model offered by a model provider, without defining it in your
`tensorzero.toml` configuration file (if supported, see below)
|
`model_name="{provider_type}::{model_name}"`
|
The following model providers support short-hand model names: `anthropic`, `deepseek`, `fireworks`, `gcp_vertex_anthropic`, `gcp_vertex_gemini`, `google_ai_studio_gemini`, `groq`, `hyperbolic`, `mistral`, `openai`, `openrouter`, `together`, and `xai`.
For example, if you have the following configuration:
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o]
routing = ["openai", "azure"]
[models.gpt-4o.providers.openai]
# ...
[models.gpt-4o.providers.azure]
# ...
[functions.extract-data]
# ...
```
Then:
* `function_name="extract-data"` calls the `extract-data` function defined above.
* `model_name="gpt-4o"` calls the `gpt-4o` model in your configuration, which supports fallback from `openai` to `azure`. See [Retries & Fallbacks](/gateway/guides/retries-fallbacks/) for details.
* `model_name="openai::gpt-4o"` calls the OpenAI API directly for the `gpt-4o` model, ignoring the `gpt-4o` model defined above.
Be careful about the different prefixes: `model_name="gpt-4o"` will use the `[models.gpt-4o]` model defined in the `tensorzero.toml` file, whereas `model_name="openai::gpt-4o"` will call the OpenAI API directly for the `gpt-4o` model.
#### `output_schema`
* **Type:** object (valid JSON Schema)
* **Required:** no
If set, this schema will override the `output_schema` defined in the function configuration for a JSON function.
This dynamic output schema is used for validating the output of the function, and sent to providers which support structured outputs.
#### `otlp_traces_extra_headers`
* **Type:** object (a map from string to string)
* **Required:** no (default: `{}`)
Dynamic headers to include in OTLP trace exports for this specific inference request.
This is useful for adding per-request metadata to OTLP trace exports (e.g. user IDs, request sources).
The headers are automatically prefixed with `tensorzero-otlp-traces-extra-header-` before being sent to the OTLP endpoint.
These headers are merged with any static headers configured in `export.otlp.traces.extra_headers`.
When the same header key is present in both static and dynamic headers, the dynamic header value takes precedence.
See [Export OpenTelemetry traces](/operations/export-opentelemetry-traces#send-custom-http-headers) for more details and examples.
#### `parallel_tool_calls`
* **Type:** boolean
* **Required:** no
If `true`, the function will be allowed to request multiple tool calls in a single conversation turn.
If not set, we default to the configuration value for the function being called.
Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field.
At the moment, only Fireworks AI and OpenAI support parallel tool calls.
#### `params`
* **Type:** object (see below)
* **Required:** no (default: `{}`)
Override inference-time parameters for a particular variant type.
This fields allows for dynamic inference parameters, i.e. defining parameters at runtime.
This field's format is `{ variant_type: { param: value, ... }, ... }`.
You should prefer to set these parameters in the configuration file if possible.
Only use this field if you need to set these parameters dynamically at runtime.
Note that the parameters will apply to every variant of the specified type.
Currently, we support the following:
* `chat_completion`
* `frequency_penalty`
* `json_mode`
* `max_tokens`
* `presence_penalty`
* `reasoning_effort`
* `seed`
* `service_tier`
* `stop_sequences`
* `temperature`
* `thinking_budget_tokens`
* `top_p`
* `verbosity`
See [Configuration Reference](/gateway/configuration-reference/#functionsfunction_namevariantsvariant_name) for more details on the parameters, and Examples below for usage.
For example, if you wanted to dynamically override the `temperature` parameter for a `chat_completion` variants, you'd include the following in the request body:
```json theme={null}
{
// ...
"params": {
"chat_completion": {
"temperature": 0.7
}
}
// ...
}
```
See ["Chat Function with Dynamic Inference Parameters"](#chat-function-with-dynamic-inference-parameters) for a complete example.
#### `provider_tools`
* **Type:** array of objects
* **Required:** no (default: `[]`)
A list of provider-specific built-in tools defined at inference time that can be used by the model.
These are tools that run server-side on the provider's infrastructure, such as OpenAI's web search tool.
Each object in the array has the following fields:
* `scope` (object, optional): Limits which model/provider combination can use this tool. If omitted, the tool is available to all compatible providers.
* `model_name` (string): The model name as defined in your configuration
* `provider_name` (string, optional): The provider name for that model. If omitted, the tool is available to all providers for the specified model.
* `tool` (object, required): The provider-specific tool configuration as defined by the provider's API
This field allows for dynamic provider tool use at runtime.
You should prefer to define provider tools in the configuration file if possible (see [Configuration Reference](/gateway/configuration-reference/#provider_tools)).
Only use this field if dynamic provider tool configuration is necessary for your use case.
```json theme={null}
{
"function_name": "my_function",
"input": {
"messages": [
{
"role": "user",
"content": "What were the latest developments in AI this week?"
}
]
},
"provider_tools": [
{
"tool": {
"type": "web_search"
}
}
]
}
```
This makes the web search tool available to all compatible providers configured for the function.
```json theme={null}
{
"function_name": "my_function",
"input": {
"messages": [
{
"role": "user",
"content": "What were the latest developments in AI this week?"
}
]
},
"provider_tools": [
{
"scope": {
"model_name": "gpt-5-mini",
"provider_name": "openai"
},
"tool": {
"type": "web_search"
}
}
]
}
```
This makes the web search tool available only to the OpenAI provider for the `gpt-5-mini` model.
#### `stream`
* **Type:** boolean
* **Required:** no
If `true`, the gateway will stream the response from the model provider.
#### `tags`
* **Type:** flat JSON object with string keys and values
* **Required:** no
User-provided tags to associate with the inference.
For example, `{"user_id": "123"}` or `{"author": "Alice"}`.
#### `tool_choice`
* **Type:** string
* **Required:** no
If set, overrides the tool choice strategy for the request.
The supported tool choice strategies are:
* `none`: The function should not use any tools.
* `auto`: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.
* `required`: The model should use a tool. If multiple tools are available, the model decides which tool to use.
* `{ specific = "tool_name" }`: The model should use a specific tool. The tool must be defined in the `tools` section of the configuration file or provided in `additional_tools`.
#### `variant_name`
* **Type:** string
* **Required:** no
If set, pins the inference request to a particular variant (not recommended).
You should generally not set this field, and instead let the TensorZero gateway assign a variant.
This field is primarily used for testing or debugging purposes.
### Response
The response format depends on the function type (as defined in the configuration file) and whether the response is streamed or not.
#### Chat Function
When the function type is `chat`, the response is structured as follows.
In regular (non-streaming) mode, the response is a JSON object with the following fields:
##### `content`
* **Type:** a list of content blocks (see below)
The content blocks generated by the model.
A content block can have `type` equal to `text` and `tool_call`.
Reasoning models (e.g. DeepSeek R1) might also include `thought` content blocks.
If `type` is `text`, the content block has the following fields:
* `text`: The text for the content block.
If `type` is `tool_call`, the content block has the following fields:
* `arguments` (object): The validated arguments for the tool call (`null` if invalid).
* `id` (string): The ID of the content block.
* `name` (string): The validated name of the tool (`null` if invalid).
* `raw_arguments` (string): The arguments for the tool call generated by the model (which might be invalid).
* `raw_name` (string): The name of the tool generated by the model (which might be invalid).
If `type` is `thought`, the content block has the following fields:
* `text` (string): The text of the thought.
If the model provider responds with a content block of an unknown type, it will be included in the response as a content block of type `unknown` with the following additional fields:
* `data`: The original content block from the provider, without any validation or transformation by TensorZero.
* `model_name` (string, optional): The model name that returned the content block.
* `provider_name` (string, optional): The provider name that returned the content block.
For example, if the model provider `your_provider_name` for `your_model_name` returns a content block of type `daydreaming`, it will be included in the response like this:
```json theme={null}
{
"type": "unknown",
"data": {
"type": "daydreaming",
"dream": "..."
},
"model_name": "your_model_name",
"provider_name": "your_provider_name"
}
```
##### `episode_id`
* **Type:** UUID
The ID of the episode associated with the inference.
##### `inference_id`
* **Type:** UUID
The ID assigned to the inference.
##### `raw_response`
* **Type:** array (optional, only when `include_raw_response` is `true`)
An array of raw provider-specific response data from all model inferences. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data`: The raw response string from the provider.
For complex variants like `experimental_best_of_n_sampling`, this includes raw responses from all candidate inferences as well as the evaluator/fuser inference.
##### `variant_name`
* **Type:** string
The name of the variant used for the inference.
##### `usage`
* **Type:** object (optional)
The usage metrics for the inference.
The object has the following fields:
* `input_tokens`: The number of input tokens used for the inference.
* `output_tokens`: The number of output tokens used for the inference.
##### `raw_usage`
* **Type:** array (optional, only when `include_raw_usage` is `true`)
An array of raw provider-specific usage data. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data` (optional): The raw usage object from the provider. The field is optional because some providers don't return usage.
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
##### `content`
* **Type:** a list of content block chunks (see below)
The content deltas for the inference.
A content block chunk can have `type` equal to `text` or `tool_call`.
Reasoning models (e.g. DeepSeek R1) might also include `thought` content block chunks.
If `type` is `text`, the chunk has the following fields:
* `id`: The ID of the content block.
* `text`: The text delta for the content block.
If `type` is `tool_call`, the chunk has the following fields (all strings):
* `id`: The ID of the content block.
* `raw_name`: The string delta of the name of the tool.
* `raw_arguments`: The string delta of the arguments for the tool call.
If `type` is `thought`, the chunk has the following fields:
* `id`: The ID of the content block.
* `text`: The text delta for the thought.
##### `episode_id`
* **Type:** UUID
The ID of the episode associated with the inference.
##### `inference_id`
* **Type:** UUID
The ID assigned to the inference.
##### `variant_name`
* **Type:** string
The name of the variant used for the inference.
##### `usage`
* **Type:** object (optional)
The usage metrics for the inference.
The object has the following fields:
* `input_tokens`: The number of input tokens used for the inference.
* `output_tokens`: The number of output tokens used for the inference.
##### `raw_usage`
* **Type:** array (optional, only when `include_raw_usage` is `true`)
An array of raw provider-specific usage data. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data` (optional): The raw usage object from the provider. The field is optional because some providers don't return usage.
##### `raw_response`
* **Type:** array (optional, only when `include_raw_response` is `true`)
An array of raw provider-specific response data from model inferences that occurred before the current streaming inference (e.g., candidate inferences in `experimental_best_of_n_sampling`). This appears in early chunks of the stream.
Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data`: The raw response string from the provider.
##### `raw_chunk`
* **Type:** string (optional, only when `include_raw_response` is `true`)
The raw chunk from the current streaming model inference. This is included in content-bearing chunks (typically all chunks except the first metadata chunk and final usage-only chunk).
#### JSON Function
When the function type is `json`, the response is structured as follows.
In regular (non-streaming) mode, the response is a JSON object with the following fields:
##### `inference_id`
* **Type:** UUID
The ID assigned to the inference.
##### `episode_id`
* **Type:** UUID
The ID of the episode associated with the inference.
##### `raw_response`
* **Type:** array (optional, only when `include_raw_response` is `true`)
An array of raw provider-specific response data from all model inferences. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data`: The raw response string from the provider.
For complex variants like `experimental_best_of_n_sampling`, this includes raw responses from all candidate inferences as well as the evaluator/fuser inference.
##### `output`
* **Type:** object (see below)
The output object contains the following fields:
* `raw`: The raw response from the model provider (which might be invalid JSON).
* `parsed`: The parsed response from the model provider (`null` if invalid JSON).
##### `variant_name`
* **Type:** string
The name of the variant used for the inference.
##### `usage`
* **Type:** object (optional)
The usage metrics for the inference.
The object has the following fields:
* `input_tokens`: The number of input tokens used for the inference.
* `output_tokens`: The number of output tokens used for the inference.
##### `raw_usage`
* **Type:** array (optional, only when `include_raw_usage` is `true`)
An array of raw provider-specific usage data. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data` (optional): The raw usage object from the provider. The field is optional because some providers don't return usage.
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
##### `episode_id`
* **Type:** UUID
The ID of the episode associated with the inference.
##### `inference_id`
* **Type:** UUID
The ID assigned to the inference.
##### `raw`
* **Type:** string
The raw response delta from the model provider.
The TensorZero Gateway does not provide a `parsed` field for streaming JSON inferences.
If your application depends on a well-formed JSON response, we recommend using regular (non-streaming) inference.
##### `variant_name`
* **Type:** string
The name of the variant used for the inference.
##### `usage`
* **Type:** object (optional)
The usage metrics for the inference.
The object has the following fields:
* `input_tokens`: The number of input tokens used for the inference.
* `output_tokens`: The number of output tokens used for the inference.
##### `raw_usage`
* **Type:** array (optional, only when `include_raw_usage` is `true`)
An array of raw provider-specific usage data. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data` (optional): The raw usage object from the provider. The field is optional because some providers don't return usage.
##### `raw_response`
* **Type:** array (optional, only when `include_raw_response` is `true`)
An array of raw provider-specific response data from model inferences that occurred before the current streaming inference (e.g., candidate inferences in `experimental_best_of_n_sampling`). This appears in early chunks of the stream.
Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data`: The raw response string from the provider.
##### `raw_chunk`
* **Type:** string (optional, only when `include_raw_response` is `true`)
The raw chunk from the current streaming model inference. This is included in content-bearing chunks (typically all chunks except the first metadata chunk and final usage-only chunk).
### Examples
#### Chat Function
##### Configuration
```toml mark="draft_email" theme={null}
// tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
# ...
```
##### Request
```python frame="code" title="POST /inference" mark="draft_email" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.inference(
function_name="draft_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
]
}
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="draft_email" theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "draft_email",
"input": {
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
]
}
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"text": "Hi Gabriel,\n\nI noticed...",
}
]
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"id": "0",
"text": "Hi Gabriel," // a text content delta
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
#### Chat Function with Schemas
##### Configuration
```toml mark="draft_email" theme={null}
// tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
system_schema = "system_schema.json"
user_schema = "user_schema.json"
# ...
```
```json /"(tone)":/ theme={null}
// system_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"tone": {
"type": "string"
}
},
"required": ["tone"],
"additionalProperties": false
}
```
```json /"(recipient)":/ /"(email_purpose)":/ theme={null}
// user_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"recipient": {
"type": "string"
},
"email_purpose": {
"type": "string"
}
},
"required": ["recipient", "email_purpose"],
"additionalProperties": false
}
```
##### Request
```python frame="code" title="POST /inference" mark="draft_email" mark="tone" mark="recipient" mark="email_purpose" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.inference(
function_name="draft_email",
input={
"system": {"tone": "casual"},
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"arguments": {
"recipient": "Gabriel",
"email_purpose": "Request a meeting to..."
}
}
]
}
]
}
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="draft_email" mark="tone" mark="recipient" mark="email_purpose" theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "draft_email",
"input": {
"system": {"tone": "casual"},
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"arguments": {
"recipient": "Gabriel",
"email_purpose": "Request a meeting to..."
}
}
]
}
]
}
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"text": "Hi Gabriel,\n\nI noticed...",
}
]
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"id": "0",
"text": "Hi Gabriel," // a text content delta
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
#### Chat Function with Tool Use
##### Configuration
```toml "weather_bot" /"(get_temperature)"/ /(get_temperature)]/ theme={null}
// tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
tools = ["get_temperature"]
# ...
[tools.get_temperature]
description = "Get the current temperature in a given location"
parameters = "get_temperature.json"
# ...
```
```json theme={null}
// get_temperature.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
```
##### Request
```python frame="code" title="POST /inference" mark="weather_bot" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.inference(
function_name="weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
}
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="weather_bot" theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "weather_bot",
"input": {
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
}
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" mark="get_temperature" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "tool_call",
"arguments": {
"location": "Tokyo",
"units": "celsius"
},
"id": "123456789",
"name": "get_temperature",
"raw_arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}",
"raw_name": "get_temperature"
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "tool_call",
"id": "123456789",
"name": "get_temperature",
"arguments": "{\"location\":" // a tool arguments delta
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
#### Chat Function with Multi-Turn Tool Use
##### Configuration
```toml "weather_bot" /"(get_temperature)"/ /(get_temperature)]/ theme={null}
// tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
tools = ["get_temperature"]
# ...
[tools.get_temperature]
description = "Get the current temperature in a given location"
parameters = "get_temperature.json"
# ...
```
```json theme={null}
// get_temperature.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
```
##### Request
```python frame="code" title="POST /inference" mark="weather_bot" mark="123456789" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.inference(
function_name="weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
},
{
"role": "assistant",
"content": [
{
"type": "tool_call",
"arguments": {
"location": "Tokyo",
"units": "celsius"
},
"id": "123456789",
"name": "get_temperature",
}
]
},
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": "123456789",
"name": "get_temperature",
"result": "25" # the tool result must be a string
}
]
}
]
}
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="weather_bot" mark="123456789" theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "weather_bot",
"input": {
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
},
{
"role": "assistant",
"content": [
{
"type": "tool_call",
"arguments": {
"location": "Tokyo",
"units": "celsius"
},
"id": "123456789",
"name": "get_temperature",
}
]
},
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": "123456789",
"name": "get_temperature",
"result": "25" // the tool result must be a string
}
]
}
]
}
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" mark="get_temperature" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"content": [
{
"type": "text",
"text": "The weather in Tokyo is 25 degrees Celsius."
}
]
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"id": "0",
"text": "The weather in" // a text content delta
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
#### Chat Function with Dynamic Tool Use
##### Configuration
```toml "weather_bot" theme={null}
// tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
# Note: no `tools = ["get_temperature"]` field in configuration
# ...
```
##### Request
```python frame="code" title="POST /inference" mark="weather_bot" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.inference(
function_name="weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
},
additional_tools=[
{
"name": "get_temperature",
"description": "Get the current temperature in a given location",
"parameters": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
}
],
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="weather_bot" theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "weather_bot",
input: {
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
},
additional_tools: [
{
"name": "get_temperature",
"description": "Get the current temperature in a given location",
"parameters": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
}
]
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" mark="get_temperature" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "tool_call",
"arguments": {
"location": "Tokyo",
"units": "celsius"
},
"id": "123456789",
"name": "get_temperature",
"raw_arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}",
"raw_name": "get_temperature"
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "tool_call",
"id": "123456789",
"name": "get_temperature",
"arguments": "{\"location\":" // a tool arguments delta
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
#### Chat Function with Dynamic Inference Parameters
##### Configuration
```toml mark="draft_email" mark="temperature" mark="chat_completion" theme={null}
// tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
# ...
[functions.draft_email.variants.prompt_v1]
type = "chat_completion"
temperature = 0.5 # the API request will override this value
# ...
```
##### Request
```python frame="code" title="POST /inference" mark="draft_email" mark="temperature" mark="chat_completion" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.inference(
function_name="draft_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
]
},
# Override parameters for every variant with type "chat_completion"
params={
"chat_completion": {
"temperature": 0.7,
}
},
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="draft_email" mark="temperature" mark="chat_completion" theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "draft_email",
"input": {
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
]
},
params={
// Override parameters for every variant with type "chat_completion"
"chat_completion": {
"temperature": 0.7,
}
}
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"text": "Hi Gabriel,\n\nI noticed...",
}
]
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"id": "0",
"text": "Hi Gabriel," // a text content delta
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
#### JSON Function
##### Configuration
```toml mark="extract_email" theme={null}
// tensorzero.toml
# ...
[functions.extract_email]
type = "json"
output_schema = "output_schema.json"
# ...
```
```json frame="code" mark="email" theme={null}
// output_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"email": {
"type": "string"
}
},
"required": ["email"]
}
```
##### Request
```python frame="code" title="POST /inference" mark="extract_email" theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
result = await client.inference(
function_name="extract_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "...blah blah blah hello@tensorzero.com blah blah blah..."
}
]
}
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="extract_email" theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "extract_email",
"input": {
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "...blah blah blah hello@tensorzero.com blah blah blah..."
}
]
}
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" mark="email" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"output": {
"raw": "{\"email\": \"hello@tensorzero.com\"}",
"parsed": {
"email": "hello@tensorzero.com"
}
}
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" mark="email" theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"raw": "{\"email\":", // a JSON content delta
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
# API Reference: Inference (OpenAI-Compatible)
Source: https://www.tensorzero.com/docs/gateway/api-reference/inference-openai-compatible
API reference for the `/openai/v1/chat/completions` endpoint.
## `POST /openai/v1/chat/completions`
The `/openai/v1/chat/completions` endpoint allows TensorZero users to make TensorZero inferences with the OpenAI client.
The gateway translates the OpenAI request parameters into the arguments expected by the `inference` endpoint and calls the same underlying implementation.
This endpoint supports most of the features supported by the `inference` endpoint, but there are some limitations.
Most notably, this endpoint doesn't support dynamic credentials, so they must be specified with a different method.
See the [API Reference for `POST /inference`](/gateway/api-reference/inference/) for more details on inference with the native TensorZero API.
### Request
The OpenAI-compatible inference endpoints translate the OpenAI request parameters into the arguments expected by the `inference` endpoint.
TensorZero-specific parameters are prefixed with `tensorzero::` (e.g. `tensorzero::episode_id`).
These fields should be provided as extra body parameters in the request body.
The gateway will use the credentials specified in the `tensorzero.toml` file.
In most cases, these credentials will be environment variables available to the TensorZero gateway — *not* your OpenAI client.
API keys sent from the OpenAI client will be ignored.
#### `tensorzero::cache_options`
* **Type:** object
* **Required:** no
Controls caching behavior for inference requests.
This object accepts two fields:
* `enabled` (string): The cache mode. Can be one of:
* `"write_only"` (default): Only write to cache but don't serve cached responses
* `"read_only"`: Only read from cache but don't write new entries
* `"on"`: Both read from and write to cache
* `"off"`: Disable caching completely
* `max_age_s` (integer or null): Maximum age in seconds for cache entries to be considered valid when reading from cache. Does not set a TTL for cache expiration. Default is `null` (no age limit).
When using the OpenAI client libraries, pass this parameter via `extra_body`.
See the [Inference Caching](/gateway/guides/inference-caching) guide for more details.
#### `tensorzero::credentials`
* **Type:** object (a map from dynamic credential names to API keys)
* **Required:** no (default: no credentials)
Each model provider in your TensorZero configuration can be configured to accept credentials at inference time by using the `dynamic` location (e.g. `dynamic::my_dynamic_api_key_name`).
See the [configuration reference](/gateway/configuration-reference/#modelsmodel_nameprovidersprovider_name) for more details.
The gateway expects the credentials to be provided in the `credentials` field of the request body as specified below.
The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.
```toml theme={null}
[models.my_model_name.providers.my_provider_name]
# ...
# Note: the name of the credential field (e.g. `api_key_location`) depends on the provider type
api_key_location = "dynamic::my_dynamic_api_key_name"
# ...
```
```json theme={null}
{
// ...
"tensorzero::credentials": {
// ...
"my_dynamic_api_key_name": "sk-..."
// ...
}
// ...
}
```
#### `tensorzero::deny_unknown_fields`
* **Type:** boolean
* **Required:** no (default: `false`)
If `true`, the gateway will return an error if the request contains any unknown or unrecognized fields.
By default, unknown fields are ignored with a warning logged.
This field does not affect the `tensorzero::extra_body` field, only unknown fields at the root of the request body.
This field should be provided as an extra body parameter in the request body.
```python theme={null}
response = oai.chat.completions.create(
model="tensorzero::model_name::openai::gpt-5-mini",
messages=[
{
"role": "user",
"content": "Tell me a fun fact.",
}
],
extra_body={
"tensorzero::deny_unknown_fields": True,
},
ultrathink=True, # made-up parameter → `deny_unknown_fields` would reject this request
)
```
#### `tensorzero::dryrun`
* **Type:** boolean
* **Required:** no
If `true`, the inference request will be executed but won't be stored to the database.
The gateway will still call the downstream model providers.
This field is primarily for debugging and testing, and you should generally not use it in production.
This field should be provided as an extra body parameter in the request body.
#### `tensorzero::episode_id`
* **Type:** UUID
* **Required:** no
The ID of an existing episode to associate the inference with.
If null, the gateway will generate a new episode ID and return it in the response.
See [Episodes](/gateway/guides/episodes) for more information.
This field should be provided as an extra body parameter in the request body.
#### `tensorzero::extra_body`
* **Type:** array of objects (see below)
* **Required:** no
The `tensorzero::extra_body` field allows you to modify the request body that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
The OpenAI SDKs generally also support such functionality.
If you use the OpenAI SDK's `extra_body` field, it will override the request from the client to the gateway.
If you use `tensorzero::extra_body`, it will override the request from the gateway to the model provider.
Each object in the array must have two or three fields:
* `pointer`: A [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) string specifying where to modify the request body
* Use `-` as the final path element to append to an array (e.g., `/messages/-` appends to `messages`)
* One of the following:
* `value`: The value to insert at that location; it can be of any type including nested types
* `delete = true`: Deletes the field at the specified location, if present.
* Optional: If one of the following is specified, the modification will only be applied to the specified variant, model, or model provider. If neither is specified, the modification applies to all model inferences.
* `variant_name`
* `model_name`
* `model_name` and `provider_name`
You can also set `extra_body` in the configuration file.
The values provided at inference-time take priority over the values in the configuration file.
If TensorZero would normally send this request body to the provider...
```json theme={null}
{
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": true
}
}
```
...then the following `extra_body` in the inference request...
```json theme={null}
{
// ...
"tensorzero::extra_body": [
{
"variant_name": "my_variant", // or "model_name": "my_model", "provider_name": "my_provider"
"pointer": "/agi",
"value": true
},
{
// No `variant_name` or `model_name`/`provider_name` specified, so it applies to all variants and providers
"pointer": "/safety_checks/no_agi",
"value": {
"bypass": "on"
}
}
]
}
```
...overrides the request body to:
```json theme={null}
{
"agi": true,
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": {
"bypass": "on"
}
}
}
```
#### `tensorzero::extra_headers`
* **Type:** array of objects (see below)
* **Required:** no
The `tensorzero::extra_headers` field allows you to modify the request headers that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
The OpenAI SDKs generally also support such functionality.
If you use the OpenAI SDK's `extra_headers` field, it will override the request from the client to the gateway.
If you use `tensorzero::extra_headers`, it will override the request from the gateway to the model provider.
Each object in the array must have two or three fields:
* `name`: The name of the header to modify
* `value`: The value to set the header to
* Optional: If one of the following is specified, the modification will only be applied to the specified variant, model, or model provider. If neither is specified, the modification applies to all model inferences.
* `variant_name`
* `model_name`
* `model_name` and `provider_name`
You can also set `extra_headers` in the configuration file.
The values provided at inference-time take priority over the values in the configuration file.
If TensorZero would normally send the following request headers to the provider...
```text theme={null}
Safety-Checks: on
```
...then the following `extra_headers`...
```json theme={null}
{
"extra_headers": [
{
"variant_name": "my_variant", // or "model_name": "my_model", "provider_name": "my_provider"
"name": "Safety-Checks",
"value": "off"
},
{
// No `variant_name` or `model_name`/`provider_name` specified, so it applies to all variants and providers
"name": "Intelligence-Level",
"value": "AGI"
}
]
}
```
...overrides the request headers so that `Safety-Checks` is set to `off` only for `my_variant`, while `Intelligence-Level: AGI` is applied globally to all variants and providers:
```text theme={null}
Safety-Checks: off
Intelligence-Level: AGI
```
#### `tensorzero::include_raw_response`
* **Type:** boolean
* **Required:** no
If `true`, the raw responses from all model inferences will be included in the response in the `tensorzero_raw_response` field as an array.
See `tensorzero_raw_response` in the [response](#response) section for more details.
#### `tensorzero::include_raw_usage`
* **Type:** boolean
* **Required:** no
If `true`, the response's `usage` object will include a `tensorzero_raw_usage` field containing an array of raw provider-specific usage data from each model inference.
This is useful for accessing provider-specific usage fields that TensorZero normalizes away, such as OpenAI's `reasoning_tokens` or Anthropic's `cache_read_input_tokens`.
For streaming requests, this requires `stream_options.include_usage` to be `true` (or omitted, in which case it will be automatically enabled).
See `tensorzero_raw_usage` in the [response](#response) section for more details.
#### `tensorzero::params`
* **Type:** object
* **Required:** no
Allows you to override inference parameters dynamically at request time.
This field accepts an object with a `chat_completion` field containing any of the following parameters:
* `frequency_penalty` (float): Penalizes tokens based on their frequency
* `json_mode` (object): Controls JSON output formatting
* `max_tokens` (integer): Maximum number of tokens to generate
* `presence_penalty` (float): Penalizes tokens based on their presence
* `reasoning_effort` (string): Effort level for reasoning models
* `seed` (integer): Random seed for deterministic outputs
* `service_tier` (string): Service tier for the request
* `stop_sequences` (list of strings): Sequences that stop generation
* `temperature` (float): Controls randomness in the output
* `thinking_budget_tokens` (integer): Token budget for thinking/reasoning
* `top_p` (float): Nucleus sampling parameter
* `verbosity` (string): Output verbosity level
When using the OpenAI-compatible endpoint, values specified in `tensorzero::params` take precedence over parameters provided directly in the request body (e.g., top-level `temperature`, `max_tokens`) or inferred from other fields (e.g., `json_mode` inferred from `response_format`).
```python theme={null}
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3000/openai/v1",
api_key="your_api_key",
)
response = client.chat.completions.create(
model="tensorzero::function_name::my_function",
messages=[
{"role": "user", "content": "Explain quantum computing"}
],
extra_body={
"tensorzero::params": {
"chat_completion": {
"temperature": 0.7,
"max_tokens": 500,
"reasoning_effort": "high"
}
}
}
)
```
#### `tensorzero::provider_tools`
* **Type:** array of objects
* **Required:** no (default: `[]`)
A list of provider-specific built-in tools that can be used by the model during inference.
These are tools that run server-side on the provider's infrastructure, such as OpenAI's web search tool.
Each object in the array has the following fields:
* `scope` (object, optional): Limits which model/provider combination can use this tool. If omitted, the tool is available to all compatible providers.
* `model_name` (string): The model name as defined in your configuration
* `provider_name` (string, optional): The provider name for that model. If omitted, the tool is available to all providers for the specified model.
* `tool` (object, required): The provider-specific tool configuration as defined by the provider's API
When using OpenAI client libraries, pass this parameter via `extra_body`.
This field allows for dynamic provider tool configuration at runtime.
You should prefer to define provider tools in the configuration file if possible (see [Configuration Reference](/gateway/configuration-reference/#provider_tools)).
Only use this field if dynamic provider tool configuration is necessary for your use case.
```python theme={null}
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3000/openai/v1",
api_key="your_api_key",
)
response = client.chat.completions.create(
model="tensorzero::function_name::my_function",
messages=[
{"role": "user", "content": "What were the latest developments in AI this week?"}
],
extra_body={
"tensorzero::provider_tools": [
{
"tool": {
"type": "web_search"
}
}
]
}
)
```
This makes the web search tool available to all compatible providers configured for the function.
```python theme={null}
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3000/openai/v1",
api_key="your_api_key",
)
response = client.chat.completions.create(
model="tensorzero::function_name::my_function",
messages=[
{"role": "user", "content": "What were the latest developments in AI this week?"}
],
extra_body={
"tensorzero::provider_tools": [
{
"scope": {
"model_name": "gpt-5-mini",
"provider_name": "openai"
},
"tool": {
"type": "web_search"
}
}
]
}
)
```
This makes the web search tool available only to the OpenAI provider for the `gpt-5-mini` model.
#### `tensorzero::tags`
* **Type:** flat JSON object with string keys and values
* **Required:** no
User-provided tags to associate with the inference.
For example, `{"user_id": "123"}` or `{"author": "Alice"}`.
#### `frequency_penalty`
* **Type:** float
* **Required:** no (default: `null`)
Penalizes new tokens based on their frequency in the text so far if positive, encourages them if negative.
Overrides the `frequency_penalty` setting for any chat completion variants being used.
#### `max_completion_tokens`
* **Type:** integer
* **Required:** no (default: `null`)
Limits the number of tokens that can be generated by the model in a chat completion variant.
If both this and `max_tokens` are set, the smaller value is used.
#### `max_tokens`
* **Type:** integer
* **Required:** no (default: `null`)
Limits the number of tokens that can be generated by the model in a chat completion variant.
If both this and `max_completion_tokens` are set, the smaller value is used.
#### `messages`
* **Type:** list
* **Required:** yes
A list of messages to provide to the model.
Each message is an object with the following fields:
* `role` (required): The role of the message sender in an OpenAI message (`assistant`, `system`, `tool`, or `user`).
* `content` (required for `user` and `system` messages and optional for `assistant` and `tool` messages): The content of the message.
The content must be either a string or an array of content blocks (see below).
* `tool_calls` (optional for `assistant` messages, otherwise disallowed): A list of tool calls. Each tool call is an object with the following fields:
* `id`: A unique identifier for the tool call
* `type`: The type of tool being called (currently only `"function"` is supported)
* `function`: An object containing:
* `name`: The name of the function to call
* `arguments`: A JSON string containing the function arguments
* `tool_call_id` (required for `tool` messages, otherwise disallowed): The ID of the tool call to associate with the message. This should be one that was originally returned by the gateway in a tool call `id` field.
A content block is an object that can have type `text`, `image_url`, or TensorZero-specific types.
If the content block has type `text`, it must have either of the following additional fields:
* `text`: The text for the content block.
* `tensorzero::arguments`: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see [Create a prompt template](/gateway/create-a-prompt-template) for details).
If a content block has type `image_url`, it must have the following additional fields:
* `"image_url"`: A JSON object with the following fields:
* `url`: The URL for a remote image (e.g. `"https://example.com/image.png"`) or base64-encoded data for an embedded image (e.g. `"data:image/png;base64,..."`).
* `detail` (optional): Controls the fidelity of image processing. Only applies to image files; ignored for other file types. Can be `low`, `high`, or `auto`. Affects token consumption and image quality.
If a content block has type `input_audio`, it must have the following additional field:
* `input_audio`: An object containing:
* `data`: Base64-encoded audio data (without a `data:` prefix or MIME type header).
* `format`: The audio format as a string (e.g., `"mp3"`, `"wav"`). Note: The MIME type is detected from the actual audio bytes, and a warning is logged if the detected type differs from this field.
The TensorZero-specific content block types are:
* `tensorzero::raw_text`: Bypasses templates and schemas, sending text directly to the model. Useful for testing prompts or dynamic injection without configuration changes. Must have a `value` field containing the text.
* `tensorzero::template`: Explicitly specify a template to use. Must have `name` and `arguments` fields.
#### `model`
* **Type:** string
* **Required:** yes
The name of the TensorZero function or model being called, with the appropriate prefix.
|
To call...
|
Use this format...
|
|
A function defined as `[functions.my_function]` in your
`tensorzero.toml` configuration file
|
`tensorzero::function_name::my_function` |
|
A model defined as `[models.my_model]` in your `tensorzero.toml`
configuration file
|
`tensorzero::model_name::my_model` |
|
A model offered by a model provider, without defining it in your
`tensorzero.toml` configuration file (if supported, see below)
|
`tensorzero::model_name::{provider_type}::{model_name}`
|
The following model providers support short-hand model names: `anthropic`, `deepseek`, `fireworks`, `gcp_vertex_anthropic`, `gcp_vertex_gemini`, `google_ai_studio_gemini`, `groq`, `hyperbolic`, `mistral`, `openai`, `openrouter`, `together`, and `xai`.
For example, if you have the following configuration:
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o]
routing = ["openai", "azure"]
[models.gpt-4o.providers.openai]
# ...
[models.gpt-4o.providers.azure]
# ...
[functions.extract-data]
# ...
```
Then:
* `tensorzero::function_name::extract-data` calls the `extract-data` function defined above.
* `tensorzero::model_name::gpt-4o` calls the `gpt-4o` model in your configuration, which supports fallback from `openai` to `azure`. See [Retries & Fallbacks](/gateway/guides/retries-fallbacks/) for details.
* `tensorzero::model_name::openai::gpt-4o` calls the OpenAI API directly for the `gpt-4o` model, ignoring the `gpt-4o` model defined above.
Be careful about the different prefixes: `tensorzero::model_name::gpt-4o` will use the `[models.gpt-4o]` model defined in the `tensorzero.toml` file, whereas `tensorzero::model_name::openai::gpt-4o` will call the OpenAI API directly for the `gpt-4o` model.
#### `parallel_tool_calls`
* **Type:** boolean
* **Required:** no (default: `null`)
Overrides the `parallel_tool_calls` setting for the function being called.
#### `presence_penalty`
* **Type:** float
* **Required:** no (default: `null`)
Penalizes new tokens based on whether they appear in the text so far if positive, encourages them if negative.
Overrides the `presence_penalty` setting for any chat completion variants being used.
#### `response_format`
* **Type:** either a string or an object
* **Required:** no (default: `null`)
Options here are `"text"`, `"json_object"`, and `"{"type": "json_schema", "schema": ...}"`, where the schema field contains a valid JSON schema.
This field is not actually respected except for the `"json_schema"` variant, in which the `schema` field can be used to dynamically set the output schema for a `json` function.
#### `seed`
* **Type:** integer
* **Required:** no (default: `null`)
Overrides the `seed` setting for any chat completion variants being used.
#### `stop_sequences`
* **Type:** list of strings
* **Required:** no (default: `null`)
Overrides the `stop_sequences` setting for any chat completion variants being used.
#### `stream`
* **Type:** boolean
* **Required:** no (default: `false`)
If true, the gateway will stream the response to the client in an OpenAI-compatible format.
#### `stream_options`
* **Type:** object with field `"include_usage"`
* **Required:** no (default: `null`)
If `"include_usage"` is `true`, the gateway will include usage information in the response.
If the following `stream_options` is provided...
```json theme={null}
{
...
"stream_options": {
"include_usage": true
}
...
}
```
...then the gateway will include usage information in the response.
```json theme={null}
{
...
"usage": {
"prompt_tokens": 123,
"completion_tokens": 456,
"total_tokens": 579
}
...
```
#### `temperature`
* **Type:** float
* **Required:** no (default: `null`)
Overrides the `temperature` setting for any chat completion variants being used.
#### `tools`
* **Type:** list of `tool` objects (see below)
* **Required:** no (default: `null`)
Allows the user to dynamically specify tools at inference time in addition to those that are specified in the configuration.
##### Function Tools
Function tools are the typical tools used with LLMs.
Each function tool object has the following structure:
* **`type`**: Must be `"function"`
* **`function`**: An object containing:
* **`name`**: The name of the function (string, required)
* **`description`**: A description of what the function does (string, optional)
* **`parameters`**: A JSON Schema object describing the function's parameters (required)
* **`strict`**: Whether to enforce strict schema validation (boolean, defaults to false)
##### OpenAI Custom Tools
OpenAI custom tools are only supported by OpenAI models (both Chat Completions and Responses APIs).
Using custom tools with other providers will result in an error.
OpenAI custom tools support alternative output formats beyond JSON Schema, such as freeform text or grammar-constrained output.
Each custom tool object has the following structure:
* **`type`**: Must be `"custom"`
* **`custom`**: An object containing:
* **`name`**: The name of the tool (string, required)
* **`description`**: A description of what the tool does (string, optional)
* **`format`**: The output format for the tool (object, optional):
* `{"type": "text"}`: Freeform text output
* `{"type": "grammar", "grammar": {"syntax": "lark", "definition": "..."}}`: Output constrained by a [Lark grammar](https://lark-parser.readthedocs.io/)
* `{"type": "grammar", "grammar": {"syntax": "regex", "definition": "..."}}`: Output constrained by a regular expression
```python theme={null}
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3000/openai/v1",
api_key="your_api_key",
)
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-5-mini",
messages=[
{"role": "user", "content": "Generate Python code to print 'Hello, World!'"}
],
tools=[
{
"type": "custom",
"custom": {
"name": "code_generator",
"description": "Generates Python code snippets",
"format": {"type": "text"}
}
}
],
)
```
#### `tool_choice`
* **Type:** string or object
* **Required:** no (default: `"none"` if no tools are present, `"auto"` if tools are present)
Controls which (if any) tool is called by the model by overriding the value in configuration. Supported values:
* `"none"`: The model will not call any tool and instead generates a message
* `"auto"`: The model can pick between generating a message or calling one or more tools
* `"required"`: The model must call one or more tools
* `{"type": "function", "function": {"name": "my_function"}}`: Forces the model to call the specified tool
* `{"type": "allowed_tools", "allowed_tools": {"tools": [...], "mode": "auto"|"required"}}`: Restricts which tools can be called
#### `top_p`
* **Type:** float
* **Required:** no (default: `null`)
Overrides the `top_p` setting for any chat completion variants being used.
#### `tensorzero::variant_name`
* **Type:** string
* **Required:** no
If set, pins the inference request to a particular variant (not recommended).
You should generally not set this field, and instead let the TensorZero gateway assign a variant.
This field is primarily used for testing or debugging purposes.
This field should be provided as an extra body parameter in the request body.
### Response
In regular (non-streaming) mode, the response is a JSON object with the following fields:
#### `choices`
* **Type:** list of `choice` objects, where each choice contains:
* **`index`**: A zero-based index indicating the choice's position in the list (integer)
* **`finish_reason`**: Always `"stop"`.
* **`message`**: An object containing:
* **`content`**: The message content (string, optional)
* **`tool_calls`**: List of tool calls made by the model (optional). The format is the same as in the request.
* **`role`**: The role of the message sender (always `"assistant"`).
The OpenAI-compatible inference endpoint can't handle unknown content blocks in the response.
If the model provider returns an unknown content block, the gateway will drop the content block from the response and log a warning.
If you need to access unknown content blocks, use the native TensorZero API.
See the [Inference API Reference](/gateway/api-reference/inference/) for details.
#### `created`
* **Type:** integer
The Unix timestamp (in seconds) of when the inference was created.
#### `episode_id`
* **Type:** UUID
The ID of the episode that the inference was created for.
#### `id`
* **Type:** UUID
The inference ID.
#### `model`
* **Type:** string
The name of the variant that was actually used for the inference.
#### `object`
* **Type:** string
The type of the inference object (always `"chat.completion"`).
#### `system_fingerprint`
* **Type:** string
Always ""
#### `usage`
* **Type:** object
Contains token usage information for the request and response, with the following fields:
* **`prompt_tokens`**: Number of tokens in the prompt (integer)
* **`completion_tokens`**: Number of tokens in the completion (integer)
* **`total_tokens`**: Total number of tokens used (integer)
#### `tensorzero_raw_response`
* **Type:** array (optional, only when `tensorzero::include_raw_response` is `true`)
An array of raw provider-specific response data from all model inferences. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `data`: The raw response string from the provider.
For complex variants like `experimental_best_of_n_sampling`, this includes raw responses from all candidate inferences as well as the evaluator/fuser inference.
#### `tensorzero_raw_usage`
* **Type:** array (optional, only when `tensorzero::include_raw_usage` is `true`)
An array of raw provider-specific usage data. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data` (optional): The raw usage object from the provider. The field is optional because some providers don't return usage.
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
#### `choices`
* **Type:** list
A list of choices from the model, where each choice contains:
* `index`: The index of the choice (integer)
* `finish_reason`: always ""
* `delta`: An object containing either:
* `content`: The next piece of generated text (string), or
* `tool_calls`: A list of tool calls, each containing the next piece of the tool call being generated
#### `created`
* **Type:** integer
The Unix timestamp (in seconds) of when the inference was created.
#### `episode_id`
* **Type:** UUID
The ID of the episode that the inference was created for.
#### `id`
* **Type:** UUID
The inference ID.
#### `model`
* **Type:** string
The name of the variant that was actually used for the inference.
#### `object`
* **Type:** string
The type of the inference object (always `"chat.completion"`).
#### `system_fingerprint`
* **Type:** string
Always ""
#### `usage`
* **Type:** object
* **Required:** no
Contains token usage information for the request and response, with the following fields:
* **`prompt_tokens`**: Number of tokens in the prompt (integer)
* **`completion_tokens`**: Number of tokens in the completion (integer)
* **`total_tokens`**: Total number of tokens used (integer)
#### `tensorzero_raw_response`
* **Type:** array (optional, only when `tensorzero::include_raw_response` is `true`)
An array of raw provider-specific response data from previous model inferences (e.g., best-of-n candidates). Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `data`: The raw response string from the provider.
This field is typically emitted in an early chunk of the stream and contains raw responses from model inferences that occurred before the current streaming inference (e.g., candidate inferences in `experimental_best_of_n_sampling`).
#### `tensorzero_raw_chunk`
* **Type:** string (optional, only when `tensorzero::include_raw_response` is `true`)
The raw chunk from the model provider as a JSON string for the current streaming inference.
#### `tensorzero_raw_usage`
* **Type:** array (optional, only when `tensorzero::include_raw_usage` is `true`)
An array of raw provider-specific usage data. Each entry contains:
* `model_inference_id`: UUID of the model inference.
* `provider_type`: The provider type (e.g., `"openai"`, `"anthropic"`).
* `api_type`: The API type (`"chat_completions"`, `"responses"`, or `"embeddings"`).
* `data` (optional): The raw usage object from the provider. The field is optional because some providers don't return usage.
### Examples
#### Chat Function with Structured System Prompt
##### Configuration
```toml mark="draft_email" theme={null}
// tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
system_schema = "functions/draft_email/system_schema.json"
# ...
```
```json theme={null}
// functions/draft_email/system_schema.json
{
"type": "object",
"properties": {
"assistant_name": { "type": "string" }
}
}
```
##### Request
```python frame="code" title="POST /inference" mark="draft_email" theme={null}
from openai import AsyncOpenAI
async with AsyncOpenAI(
base_url="http://localhost:3000/openai/v1"
) as client:
result = await client.chat.completions.create(
# there already was an episode_id from an earlier inference
extra_body={"tensorzero::episode_id": str(episode_id)},
messages=[
{
"role": "system",
"content": [{"assistant_name": "Alfred Pennyworth"}]
# NOTE: the JSON is in an array here so that a structured system message can be sent
},
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
],
model="tensorzero::function_name::draft_email",
temperature=0.4,
# Optional: stream=True
)
```
```bash frame="code" title="POST /inference" mark="draft_email" theme={null}
curl -X POST http://localhost:3000/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "episode_id: your_episode_id_here" \
-d '{
"messages": [
{
"role": "system",
"content": [{"assistant_name": "Alfred Pennyworth"}]
},
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
],
"model": "tensorzero::function_name::draft_email",
"temperature": 0.4
// Optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" theme={null}
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "email_draft_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": "Hi Gabriel,\n\nI noticed...",
"role": "assistant"
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "email_draft_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"delta": {
"content": "Hi Gabriel,\n\nI noticed..."
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}
```
#### Chat Function with Dynamic Tool Use
##### Configuration
```toml "weather_bot" theme={null}
// tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
# Note: no `tools = ["get_temperature"]` field in configuration
# ...
```
##### Request
```python frame="code" title="POST /inference" mark="weather_bot" theme={null}
from openai import AsyncOpenAI
async with AsyncOpenAI(
base_url="http://localhost:3000/openai/v1"
) as client:
result = await client.chat.completions.create(
model="tensorzero::function_name::weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
},
tools=[
{
"type": "function",
"function": {
"name": "get_temperature",
"description": "Get the current temperature in a given location",
"parameters": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
}
}
],
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="weather_bot" theme={null}
curl -X POST http://localhost:3000/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tensorzero::function_name::weather_bot",
"input": {
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
},
"tools": [
{
"type": "function",
"function": {
"name": "get_temperature",
"description": "Get the current temperature in a given location",
"parameters": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
}
}
]
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" mark="get_temperature" theme={null}
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "weather_bot_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": null,
"tool_calls": [
{
"id": "123456789",
"type": "function",
"function": {
"name": "get_temperature",
"arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}"
}
}
],
"role": "assistant"
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" theme={null}
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "weather_bot_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": null,
"tool_calls": [
{
"id": "123456789",
"type": "function",
"function": {
"name": "get_temperature",
"arguments": "{\"location\":" // a tool arguments delta
}
}
]
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}
```
#### Json Function with Dynamic Output Schema
##### Configuration
```toml mark="extract_email" theme={null}
// tensorzero.toml
# ...
[functions.extract_email]
type = "json"
output_schema = "output_schema.json"
# ...
```
```json frame="code" mark="email" theme={null}
// output_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"email": {
"type": "string"
}
},
"required": ["email"]
}
```
##### Request
```python frame="code" title="POST /inference" mark="extract_email" theme={null}
from openai import AsyncOpenAI
dynamic_output_schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"email": { "type": "string" },
"domain": { "type": "string" }
},
"required": ["email", "domain"]
}
async with AsyncOpenAI(
base_url="http://localhost:3000/openai/v1"
) as client:
result = await client.chat.completions.create(
model="tensorzero::function_name::extract_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "...blah blah blah hello@tensorzero.com blah blah blah..."
}
]
}
# Override the output schema using the `response_format` field
response_format={"type": "json_schema", "schema": dynamic_output_schema}
# optional: stream=True,
)
```
```bash frame="code" title="POST /inference" mark="extract_email" theme={null}
curl -X POST http://localhost:3000/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tensorzero::function_name::extract_email",
"input": {
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "...blah blah blah hello@tensorzero.com blah blah blah..."
}
]
},
"response_format": {
"type": "json_schema",
"schema": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"email": { "type": "string" },
"domain": { "type": "string" }
},
"required": ["email", "domain"]
}
},
// optional: "stream": true
}'
```
##### Response
```json frame="code" title="POST /inference" mark="email" theme={null}
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "extract_email_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": "{\"email\": \"hello@tensorzero.com\", \"domain\": \"tensorzero.com\"}"
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}
```
In streaming mode, the response is an SSE stream of JSON messages, followed by a final `[DONE]` message.
Each JSON message has the following fields:
```json frame="code" title="POST /inference" mark="email" theme={null}
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "extract_email_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": "{\"email\":" // a JSON content delta
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}
```
# Benchmarks
Source: https://www.tensorzero.com/docs/gateway/benchmarks
Benchmarks for the TensorZero Gateway: sub-millisecond latency overhead under extreme load
The TensorZero Gateway was built from the ground up with performance in mind.
It's written in Rust and designed to handle extreme concurrency with sub-millisecond overhead.
See ["Optimize latency and throughput" guide](/deployment/optimize-latency-and-throughput/) for more details on maximizing performance in production settings.
## TensorZero Gateway vs. LiteLLM
* **TensorZero achieves sub-millisecond latency overhead even at 10,000 QPS.**
* **LiteLLM degrades at hundreds of QPS and fails entirely at 1,000 QPS.**
We benchmarked the TensorZero Gateway against the popular LiteLLM Proxy (LiteLLM Gateway).
In a `c7i.xlarge` instance on AWS (4 vCPUs, 8 GB RAM), LiteLLM fails when concurrency reaches 1,000 QPS with the vast majority of requests timing out.
TensorZero Gateway handles 10,000 QPS in the same instance with 100% success rate and sub-millisecond latencies.
Even at low loads where LiteLLM is stable (100 QPS), TensorZero at 10,000 QPS achieves significantly lower latencies.
Building in Rust (TensorZero) led to consistent sub-millisecond latency overhead under extreme load, whereas Python (LiteLLM) becomes a bottleneck even at moderate loads.
### Latency Comparison
| Latency | LiteLLM Proxy
(100 QPS) | LiteLLM Proxy
(500 QPS) | LiteLLM Proxy
(1,000 QPS) | TensorZero Gateway
(10,000 QPS) |
| :-----: | :----------------------------: | :----------------------------: | :------------------------------: | :------------------------------------: |
| Mean | 4.91ms | 7.45ms | Failure | 0.37ms |
| 50% | 4.83ms | 5.81ms | Failure | 0.35ms |
| 90% | 5.26ms | 10.02ms | Failure | 0.50ms |
| 95% | 5.41ms | 13.40ms | Failure | 0.58ms |
| 99% | 5.87ms | 39.69ms | Failure | 0.94ms |
At 1,000 QPS, LiteLLM fails entirely with the vast majority of requests timing out, while TensorZero continues to operate smoothly even at 10x that load.
**Technical Notes:**
* We use a `c7i.xlarge` instance on AWS (4 vCPUs, 8 GB RAM) running Ubuntu 24.04.2 LTS.
* We use a mock OpenAI inference provider for both benchmarks.
* The load generator, both gateways, and the mock inference provider all run on the same instance.
* We configured `observability.enabled = false` (i.e. disabled logging inferences to ClickHouse) in the TensorZero Gateway to make the scenarios comparable. (Even then, the observability features run asynchronously in the background, so they wouldn't materially affect latency given a powerful enough ClickHouse deployment.)
* The most recent benchmark run was conducted on July 30, 2025. It used TensorZero `2025.5.7` and LiteLLM `1.74.9`.
Read more about the technical details and reproduction instructions [here](https://github.com/tensorzero/tensorzero/tree/main/gateway/benchmarks).
# How to call any LLM
Source: https://www.tensorzero.com/docs/gateway/call-any-llm
Learn how to call any LLM with a unified API using the TensorZero Gateway.
This page shows how to:
* **Call any LLM with the same API.** TensorZero unifies every major LLM API (e.g. OpenAI) and inference server (e.g. Ollama).
* **Get started with a few lines of code.** Later, you can optionally add observability, automatic fallbacks, A/B testing, and much more.
* **Use any programming language.** You can use TensorZero with its Python SDK, any OpenAI SDK (Python, Node, Go, etc.), or its HTTP API.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/gateway/call-any-llm) of this guide on GitHub.
The TensorZero Python SDK provides a unified API for calling any LLM.
For example, if you're using OpenAI, you can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
See the [Integrations](/integrations/model-providers) page to learn how to set up credentials for other LLM providers.
You can install the TensorZero SDK with a Python package manager like `pip`.
```bash theme={null}
pip install tensorzero
```
Let's initialize the TensorZero Gateway.
For simplicity, we'll use an embedded gateway without observability or custom configuration.
```python theme={null}
from tensorzero import TensorZeroGateway
t0 = TensorZeroGateway.build_embedded()
```
The TensorZero Python SDK includes a synchronous `TensorZeroGateway` client and an asynchronous `AsyncTensorZeroGateway` client.
Both options support running the gateway embedded in your application with `build_embedded` or connecting to a standalone gateway with `build_http`.
See [Clients](/gateway/clients/) for more details.
```python theme={null}
response = t0.inference(
model_name="openai::gpt-5-mini",
# or: model="anthropic::claude-sonnet-4-20250514"
# or: Google, AWS, Azure, xAI, vLLM, Ollama, and many more
input={
"messages": [
{
"role": "user",
"content": "Tell me a fun fact.",
}
]
},
)
```
```python theme={null}
ChatInferenceResponse(
inference_id=UUID('0198d339-be77-74e0-b522-e08ec12d3831'),
episode_id=UUID('0198d339-be77-74e0-b522-e09f578f34d0'),
variant_name='openai::gpt-5-mini',
content=[
Text(
text='Fun fact: Botanically, bananas are berries but strawberries are not. \n\nA true berry develops from a single ovary and has seeds embedded in the flesh—bananas fit that definition. Strawberries are "aggregate accessory fruits": the tiny seeds on the outside are each from a separate ovary.',
arguments=None,
type='text'
)
],
usage=Usage(input_tokens=12, output_tokens=261),
finish_reason=FinishReason.STOP,
raw_response=None
)
```
See the [Inference API Reference](/gateway/api-reference/inference) for more details on the request and response formats.
The TensorZero Python SDK integrates with the OpenAI Python SDK to provide a unified API for calling any LLM.
For example, if you're using OpenAI, you can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
See the [Integrations](/integrations/model-providers) page to learn how to set up credentials for other LLM providers.
You can install the OpenAI and TensorZero SDKs with a Python package manager like `pip`.
```bash theme={null}
pip install openai tensorzero
```
Let's initialize the TensorZero Gateway and patch the OpenAI client to use it.
For simplicity, we'll use an embedded gateway without observability or custom configuration.
```python theme={null}
from openai import OpenAI
from tensorzero import patch_openai_client
client = OpenAI()
patch_openai_client(client, async_setup=False)
```
The TensorZero Python SDK supports both the synchronous `OpenAI` client and the asynchronous `AsyncOpenAI` client.
Both options support running the gateway embedded in your application with `patch_openai_client` or connecting to a standalone gateway with `base_url`.
The embedded gateway supports synchronous initialization with `async_setup=False` or asynchronous initialization with `async_setup=True`.
See [Clients](/gateway/clients/) for more details.
```python theme={null}
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-5-mini",
# or: model="tensorzero::model_name::anthropic::claude-sonnet-4-20250514"
# or: Google, AWS, Azure, xAI, vLLM, Ollama, and many more
messages=[
{
"role": "user",
"content": "Tell me a fun fact.",
}
],
)
```
```python theme={null}
ChatCompletion(
id='0198d33f-24f6-7cc3-9dd0-62ba627b27db',
choices=[
Choice(
finish_reason='stop',
index=0,
logprobs=None,
message=ChatCompletionMessage(
content='Sure! Did you know that octopuses have three hearts? Two pump blood to the gills, while the third pumps it to the rest of the body. And, when an octopus swims, the heart that delivers blood to the body actually **stops beating**—which is why they prefer to crawl rather than swim!',
refusal=None,
role='assistant',
annotations=None,
audio=None,
function_call=None,
tool_calls=[]
)
)
],
created=1755890789,
model='tensorzero::model_name::openai::gpt-5-mini',
object='chat.completion',
service_tier=None,
system_fingerprint='',
usage=CompletionUsage(
completion_tokens=67,
prompt_tokens=13,
total_tokens=80,
completion_tokens_details=None,
prompt_tokens_details=None
),
episode_id='0198d33f-24f6-7cc3-9dd0-62cd7028c3d7'
)
```
See the [Inference (OpenAI) API Reference](/gateway/api-reference/inference-openai-compatible) for more details on the request and response formats.
You can point the OpenAI Node SDK to a TensorZero Gateway to call any LLM with a unified API.
For example, if you're using OpenAI, you can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
See the [Integrations](/integrations/model-providers) page to learn how to set up credentials for other LLM providers.
You can install the OpenAI SDK with a package manager like `npm`.
```bash theme={null}
npm i openai
```
Let's deploy a standalone TensorZero Gateway using Docker.
For simplicity, we'll use the gateway without observability or custom configuration.
```bash theme={null}
docker run \
-e OPENAI_API_KEY \
-p 3000:3000 \
tensorzero/gateway \
--default-config
```
See the [TensorZero Gateway Deployment](/deployment/tensorzero-gateway) page for more details.
Let's initialize the OpenAI SDK and point it to the gateway we just launched.
```ts theme={null}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/openai/v1",
});
```
```ts theme={null}
const response = await client.chat.completions.create({
model: "tensorzero::model_name::openai::gpt-5-mini",
// or: model: "tensorzero::model_name::anthropic::claude-sonnet-4-20250514",
// or: Google, AWS, Azure, xAI, vLLM, Ollama, and many more
messages: [
{
role: "user",
content: "Tell me a fun fact.",
},
],
});
```
```ts theme={null}
{
id: '0198d345-4bd5-79a2-a235-ebaea8c16d91',
episode_id: '0198d345-4bd5-79a2-a235-ebbf6eb49cb8',
choices: [
{
index: 0,
finish_reason: 'stop',
message: {
content: 'Sure! Did you know that honey never spoils? Archaeologists have found pots of honey in ancient Egyptian tombs that are over 3,000 years old—and still perfectly edible!',
tool_calls: [],
role: 'assistant'
}
}
],
created: 1755891192,
model: 'tensorzero::model_name::openai::gpt-5-mini',
system_fingerprint: '',
service_tier: null,
object: 'chat.completion',
usage: { prompt_tokens: 13, completion_tokens: 37, total_tokens: 50 }
}
```
See the [Inference (OpenAI) API Reference](/gateway/api-reference/inference-openai-compatible) for more details on the request and response formats.
You can call the TensorZero Gateway directly over HTTP to access any LLM with a unified API.
For example, if you're using OpenAI, you can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
See the [Integrations](/integrations/model-providers) page to learn how to set up credentials for other LLM providers.
Let's deploy a standalone TensorZero Gateway using Docker.
For simplicity, we'll use the gateway without observability or custom configuration.
```bash theme={null}
docker run \
-e OPENAI_API_KEY \
-p 3000:3000 \
tensorzero/gateway \
--default-config
```
See the [TensorZero Gateway Deployment](/deployment/tensorzero-gateway) page for more details.
You can call the LLM by sending a `POST` request to the `/inference` endpoint of the TensorZero Gateway.
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-5-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Tell me a fun fact."
}
]
}
}'
```
```json theme={null}
{
"inference_id": "0198d351-b250-70d1-a24a-a255d148d7a6",
"episode_id": "0198d351-b250-70d1-a24a-a2690343bcf0",
"variant_name": "openai::gpt-5-mini",
"content": [
{
"type": "text",
"text": "Fun fact: botanically, bananas are berries but strawberries are not. \n\nIn botanical terms a \"berry\" develops from a single ovary and has seeds embedded in the flesh—bananas fit that definition, while strawberries are aggregate accessory fruits (the little \"seeds\" on the outside are actually separate ovaries). Want another fun fact?"
}
],
"usage": {
"input_tokens": 12,
"output_tokens": 334
},
"finish_reason": "stop"
}
```
See the [Inference API Reference](/gateway/api-reference/inference) for more details on the request and response formats.
See [Configure models and providers](/gateway/configure-models-and-providers) to set up multiple providers with routing and fallbacks and [Configure functions and variants](/gateway/configure-functions-and-variants) to manage your LLM logic with experimentation and observability.
# How to call the OpenAI Responses API
Source: https://www.tensorzero.com/docs/gateway/call-the-openai-responses-api
Learn how to use OpenAI's Responses API with built-in tools like web search.
This page shows how to:
* **Use a unified API.** TensorZero provides the same chat completion format for the Responses API.
* **Access built-in tools.** Enable built-in tools from OpenAI like `web_search`.
* **Enable reasoning models.** Support models with extended thinking capabilities.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/gateway/call-the-openai-responses-api) of this guide on GitHub.
## Call the OpenAI Responses API
The TensorZero Python SDK provides a unified API for calling OpenAI's Responses API.
You can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
You can install the TensorZero SDK with a Python package manager like `pip`.
```bash theme={null}
pip install tensorzero
```
Create a configuration file with a model using `api_type = "responses"` and provider tools:
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses-web-search]
routing = ["openai"]
[models.gpt-5-mini-responses-web-search.providers.openai]
type = "openai"
model_name = "gpt-5-mini"
api_type = "responses"
include_encrypted_reasoning = true
provider_tools = [{type = "web_search"}] # built-in OpenAI web search tool
# Enable plain-text summaries of encrypted reasoning
extra_body = [
{ pointer = "/reasoning", value = { effort = "low", summary = "auto" } }
]
```
If you don't need to customize the model configuration (e.g. `include_encrypted_reasoning`, `provider_tools`), you can use the short-hand model name `openai::responses::gpt-5-codex` to call it directly.
Let's deploy a standalone TensorZero Gateway using Docker.
For simplicity, we'll use the gateway with the configuration above.
```bash theme={null}
docker run \
-e OPENAI_API_KEY \
-v $(pwd)/tensorzero.toml:/app/config/tensorzero.toml:ro \
-p 3000:3000 \
tensorzero/gateway \
--config-file /app/config/tensorzero.toml
```
See the [TensorZero Gateway Deployment](/deployment/tensorzero-gateway) page for more details.
Let's initialize the TensorZero Gateway client and point it to the gateway we just launched.
```python theme={null}
from tensorzero import TensorZeroGateway
t0 = TensorZeroGateway.build_http(gateway_url="http://localhost:3000")
```
The TensorZero Python SDK includes a synchronous `TensorZeroGateway` client and an asynchronous `AsyncTensorZeroGateway` client.
Both options support running the gateway embedded in your application with `build_embedded` or connecting to a standalone gateway with `build_http`.
See [Clients](/gateway/clients/) for more details.
OpenAI web search can take up to a minute to complete.
```python theme={null}
response = t0.inference(
model_name="gpt-5-mini-responses-web-search",
input={
"messages": [
{
"role": "user",
"content": "What is the current population of Japan?",
}
]
},
# Thought summaries are enabled in tensorzero.toml via extra_body
)
```
```python theme={null}
ChatInferenceResponse(
inference_id=UUID('0199ff78-6246-7c12-b4b0-6e3a881cc6b9'),
episode_id=UUID('0199ff78-6246-7c12-b4b0-6e4367f949b8'),
variant_name='gpt-5-mini-responses-web-search',
content=[
Thought(
text=None,
type='thought',
signature='gAAAAABo9...',
summary=[
ThoughtSummaryBlock(
text="I need to search for Japan's current population data.",
type='summary_text'
)
],
provider_type='openai'
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59fda57d481969c3603df0d675348',
'type': 'web_search_call',
'status': 'completed',
'action': {
'type': 'search',
'query': 'Japan population 2025 October 2025 population estimate Statistics Bureau of Japan'
}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59fdf9b988196b36756d639e2b015',
'type': 'web_search_call',
'status': 'completed',
'action': {
'type': 'search',
'query': "Ministry of Internal Affairs and Communications Japan population Oct 1 2024 'total population' 'Japan' 'population estimates' '2024' 'Oct. 1' '総人口' '令和6年' "
}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59fe1a388819684971acfdaf4cd44',
'type': 'web_search_call',
'status': 'completed',
'action': {
'type': 'search',
'query': "Ministry of Internal Affairs and Communications population Japan Oct 1 2024 total population 'Oct. 1, 2024' 'population' 'Japan' 'MIC' 'population estimates' '2024' '総人口' "
}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59fe439788196911a195c70cc8ca9',
'type': 'web_search_call',
'status': 'completed',
'action': {'type': 'search'}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59fe6b140819690a4468d3304fece',
'type': 'web_search_call',
'status': 'completed',
'action': {'type': 'search'}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59fe81e408196921b69174f6abaf7',
'type': 'web_search_call',
'status': 'completed',
'action': {'type': 'search'}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59feda6188196827a0b5aa01e96a1',
'type': 'web_search_call',
'status': 'completed',
'action': {
'type': 'search',
'query': "United Nations World Population Prospects 2024 Japan 2025 population 'Japan population 2025' 'World Population Prospects 2024' 'Japan' "
}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59ff3cc8881968d1c5c9c1bbe4ecc',
'type': 'web_search_call',
'status': 'completed',
'action': {
'type': 'search',
'query': "UN World Population Prospects 2024 Japan population 2025 '123,103,479' 'Japan 2025' 'World Population Prospects' 'Japan' '2025' "
}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
UnknownContentBlock(
data={
'id': 'ws_05489a0b57dc84980168f59ff67ed48196a0054a38e96f8e0c',
'type': 'web_search_call',
'status': 'completed',
'action': {
'type': 'search',
'query': "United Nations population Japan 2025 'World Population Prospects 2024' 'Japan population 2025' site:un.org"
}
},
model_name='gpt-5-mini-responses-web-search',
provider_name='openai',
type='unknown'
),
Thought(
text=None,
type='thought',
signature='gAAAAABo...',
provider_type=None
),
Text(
text="Short answer: about 123–124 million people.\n\nMore precisely:\n- Japan's official estimate (Ministry of Internal Affairs and Communications / e‑Stat) reported a total population of 123,802,000 (including foreign residents) as of October 1, 2024 (release published Apr 14, 2025). ([e-stat.go.jp](https://www.e-stat.go.jp/en/stat-search/files?layout=dataset&page=1&query=Population+Estimates%2C+natural)) \n- The United Nations (WPP 2024, used by sources such as Worldometer) gives a mid‑2025 estimate of about 123.1 million. ([srv1.worldometers.info](https://srv1.worldometers.info/world-population/japan-population/?utm_source=openai))\n\nDo you want a live "right now" estimate for today (Oct 20, 2025) or a breakdown by Japanese nationals vs. foreign residents? I can fetch the latest live or official figures for the exact date you want.",
arguments=None,
type='text'
)
],
usage=Usage(input_tokens=29904, output_tokens=1921),
finish_reason=None,
raw_response=None
)
```
The TensorZero Python SDK integrates with the OpenAI Python SDK to provide access to the Responses API.
You can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
You can install the OpenAI and TensorZero SDKs with a Python package manager like `pip`.
```bash theme={null}
pip install openai tensorzero
```
Create a configuration file with a model using `api_type = "responses"` and provider tools:
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses-web-search]
routing = ["openai"]
[models.gpt-5-mini-responses-web-search.providers.openai]
type = "openai"
model_name = "gpt-5-mini"
api_type = "responses"
include_encrypted_reasoning = true
provider_tools = [{type = "web_search"}] # built-in OpenAI web search tool
# Enable plain-text summaries of encrypted reasoning
extra_body = [
{ pointer = "/reasoning", value = { effort = "low", summary = "auto" } }
]
```
Let's deploy a standalone TensorZero Gateway using Docker.
For simplicity, we'll use the gateway with the configuration above.
```bash theme={null}
docker run \
-e OPENAI_API_KEY \
-v $(pwd)/tensorzero.toml:/app/config/tensorzero.toml:ro \
-p 3000:3000 \
tensorzero/gateway \
--config-file /app/config/tensorzero.toml
```
See the [TensorZero Gateway Deployment](/deployment/tensorzero-gateway) page for more details.
Let's initialize the OpenAI SDK and point it to the gateway we just launched.
```python theme={null}
from openai import OpenAI
oai = OpenAI(api_key="not-used", base_url="http://localhost:3000/openai/v1")
```
The TensorZero Python SDK supports both the synchronous `OpenAI` client and the asynchronous `AsyncOpenAI` client.
Both options support running the gateway embedded in your application with `patch_openai_client` or connecting to a standalone gateway with `base_url`.
See [Clients](/gateway/clients/) for more details.
OpenAI web search can take up to a minute to complete.
```python theme={null}
response = oai.chat.completions.create(
model="tensorzero::model_name::gpt-5-mini-responses-web-search",
messages=[
{
"role": "user",
"content": "What is the current population of Japan?",
}
],
)
```
The OpenAI SDK does not support additional content blocks (e.g. thoughts) in the chat completions API, so they are omitted.
Please use the TensorZero SDK if you want access to these auxiliary content blocks.
```python theme={null}
ChatCompletion(
id='0199ff78-5bad-7312-ab13-e4c5fa0bde8d',
choices=[
Choice(
finish_reason='stop',
index=0,
logprobs=None,
message=ChatCompletionMessage(
content="Short answer — it depends on the source/date:\n\n- Japan's official demographic survey (Ministry of Internal Affairs and Communications, reported by major Japanese outlets) shows a total population of 124,330,690 as of January 1, 2025 (this includes foreign residents). ([asahi.com](https://www.asahi.com/ajw/articles/15952384?utm_source=openai))\n\n- International mid‑year estimates (United Nations/UNFPA) put Japan's 2025 population at about 123.1 million (mid‑2025 estimate), which uses a different methodology and reference date. ([unfpa.org](https://www.unfpa.org/data/world-population/JP?utm_source=openai))\n\nToday is October 20, 2025 — would you like me to fetch a live or another specific estimate (e.g., UN mid‑year, World Bank, or the latest Japanese government update)?",
refusal=None,
role='assistant',
annotations=None,
audio=None,
function_call=None,
tool_calls=[]
)
)
],
created=1760927745,
model='tensorzero::model_name::gpt-5-mini-responses-web-search',
object='chat.completion',
service_tier=None,
system_fingerprint='',
usage=CompletionUsage(
completion_tokens=2304,
prompt_tokens=21444,
total_tokens=23748,
completion_tokens_details=None,
prompt_tokens_details=None
),
episode_id='0199ff78-5bad-7312-ab13-e4d8708e5b73'
)
```
You can point the OpenAI Node SDK to a TensorZero Gateway to access the Responses API.
You can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
You can install the OpenAI SDK with a package manager like `npm`.
```bash theme={null}
npm i openai
```
Create a configuration file with a model using `api_type = "responses"` and provider tools:
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses-web-search]
routing = ["openai"]
[models.gpt-5-mini-responses-web-search.providers.openai]
type = "openai"
model_name = "gpt-5-mini"
api_type = "responses"
include_encrypted_reasoning = true
provider_tools = [{type = "web_search"}] # built-in OpenAI web search tool
# Enable plain-text summaries of encrypted reasoning
extra_body = [
{ pointer = "/reasoning", value = { effort = "low", summary = "auto" } }
]
```
Let's deploy a standalone TensorZero Gateway using Docker.
For simplicity, we'll use the gateway with the configuration above.
```bash theme={null}
docker run \
-e OPENAI_API_KEY \
-v $(pwd)/tensorzero.toml:/app/config/tensorzero.toml:ro \
-p 3000:3000 \
tensorzero/gateway \
--config-file /app/config/tensorzero.toml
```
See the [TensorZero Gateway Deployment](/deployment/tensorzero-gateway) page for more details.
Let's initialize the OpenAI SDK and point it to the gateway we just launched.
```ts theme={null}
import OpenAI from "openai";
const oai = new OpenAI({
apiKey: "not-used",
baseURL: "http://localhost:3000/openai/v1",
});
```
OpenAI web search can take up to a minute to complete.
```ts theme={null}
const response = await oai.chat.completions.create({
model: "tensorzero::model_name::gpt-5-mini-responses-web-search",
messages: [
{
role: "user",
content: "What is the current population of Japan?",
},
],
});
```
The OpenAI SDK does not support additional content blocks (e.g. thoughts) in the chat completions API, so they are omitted.
Please use the TensorZero SDK if you want access to these auxiliary content blocks.
```json theme={null}
{
id: '0199ff74-0203-70d1-857a-a52b89291955',
episode_id: '0199ff74-0203-70d1-857a-a53eb122c72f',
choices: [
{
index: 0,
finish_reason: 'stop',
message: {
content: 'According to Japan’s Statistics Bureau, the preliminary population count was 12,317 ten‑thousand (i.e., 123,170,000) as of September 1, 2025. ([stat.go.jp](https://www.stat.go.jp/english/?s=1&vm=r))\n' +
'\n' +
'Would you like a mid‑year UN estimate or the latest monthly update?',
tool_calls: [],
role: 'assistant'
}
}
],
created: 1760927476,
model: 'tensorzero::model_name::gpt-5-mini-responses-web-search',
system_fingerprint: '',
service_tier: null,
object: 'chat.completion',
usage: {
prompt_tokens: 32210,
completion_tokens: 2253,
total_tokens: 34463
}
}
```
You can call the TensorZero Gateway directly over HTTP to access the OpenAI Responses API.
You can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
Create a configuration file with a model using `api_type = "responses"` and provider tools:
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses-web-search]
routing = ["openai"]
[models.gpt-5-mini-responses-web-search.providers.openai]
type = "openai"
model_name = "gpt-5-mini"
api_type = "responses"
include_encrypted_reasoning = true
provider_tools = [{type = "web_search"}] # built-in OpenAI web search tool
# Enable plain-text summaries of encrypted reasoning
extra_body = [
{ pointer = "/reasoning", value = { effort = "low", summary = "auto" } }
]
```
Let's deploy a standalone TensorZero Gateway using Docker.
For simplicity, we'll use the gateway with the configuration above.
```bash theme={null}
docker run \
-e OPENAI_API_KEY \
-v $(pwd)/tensorzero.toml:/app/config/tensorzero.toml:ro \
-p 3000:3000 \
tensorzero/gateway \
--config-file /app/config/tensorzero.toml
```
See the [TensorZero Gateway Deployment](/deployment/tensorzero-gateway) page for more details.
You can call the LLM by sending a `POST` request to the `/inference` endpoint of the TensorZero Gateway.
OpenAI web search can take up to a minute to complete.
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "gpt-5-mini-responses-web-search",
"input": {
"messages": [
{
"role": "user",
"content": "What is the current population of Japan?"
}
]
}
}'
```
Thought summaries are enabled in `tensorzero.toml` via `extra_body` on the model configuration.
```json theme={null}
{
"inference_id": "0199ff71-33e2-7700-9d5f-43caeb1125ed",
"episode_id": "0199ff71-33e2-7700-9d5f-43d703c41609",
"variant_name": "gpt-5-mini-responses-web-search",
"content": [
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [
{
"type": "summary_text",
"text": "I need to search for Japan's current population data."
}
],
"provider_type": "openai"
},
{
"type": "unknown",
"data": {
"id": "ws_0dd147cea07b72510168f59e0496608194a85c1b0ff33c6203",
"type": "web_search_call",
"status": "completed",
"action": {
"type": "search",
"query": "Japan population 2025 estimated population October 2025 Japan population"
}
},
"model_name": "gpt-5-mini-responses-web-search",
"provider_name": "openai"
},
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [],
"provider_type": "openai"
},
{
"type": "unknown",
"data": {
"id": "ws_0dd147cea07b72510168f59e08f80881948b4ad8dbd8003a36",
"type": "web_search_call",
"status": "completed",
"action": {
"type": "search",
"query": "UN World Population Prospects 2024 Japan population 2025 mid-year 'Japan population 2025' 'World Population Prospects 2024' 'Japan' "
}
},
"model_name": "gpt-5-mini-responses-web-search",
"provider_name": "openai"
},
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [],
"provider_type": "openai"
},
{
"type": "unknown",
"data": {
"id": "ws_0dd147cea07b72510168f59e0c90f88194b1f0cf35f706c756",
"type": "web_search_call",
"status": "completed",
"action": {
"type": "search"
}
},
"model_name": "gpt-5-mini-responses-web-search",
"provider_name": "openai"
},
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [],
"provider_type": "openai"
},
{
"type": "unknown",
"data": {
"id": "ws_0dd147cea07b72510168f59e0f26a88194aa6a8e82fad8fc7f",
"type": "web_search_call",
"status": "completed",
"action": {
"type": "search",
"query": "Statistics Bureau of Japan population October 1 2025 \"Population Estimates\" \"Japan\" site:stat.go.jp"
}
},
"model_name": "gpt-5-mini-responses-web-search",
"provider_name": "openai"
},
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [],
"provider_type": "openai"
},
{
"type": "unknown",
"data": {
"id": "ws_0dd147cea07b72510168f59e166aac8194a8913647411512b4",
"type": "web_search_call",
"status": "completed",
"action": {
"type": "search",
"query": "UN World Population Prospects 2024 Japan population 2025 'Japan population 2025 UN WPP' 'United Nations Department of Economic and Social Affairs' 'Japan 2025 population' "
}
},
"model_name": "gpt-5-mini-responses-web-search",
"provider_name": "openai"
},
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [],
"provider_type": "openai"
},
{
"type": "unknown",
"data": {
"id": "ws_0dd147cea07b72510168f59e1925088194bb9a8f934b1e6bf1",
"type": "web_search_call",
"status": "completed",
"action": {
"type": "search",
"query": "World Population Prospects 2024 Japan population 2025 site:un.org"
}
},
"model_name": "gpt-5-mini-responses-web-search",
"provider_name": "openai"
},
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [],
"provider_type": "openai"
},
{
"type": "unknown",
"data": {
"id": "ws_0dd147cea07b72510168f59e1ea20081948eb2d81de67d12bb",
"type": "web_search_call",
"status": "completed",
"action": {
"type": "search"
}
},
"model_name": "gpt-5-mini-responses-web-search",
"provider_name": "openai"
},
{
"type": "thought",
"text": null,
"signature": "gAAAAABo...",
"summary": [],
"provider_type": "openai"
},
{
"type": "text",
"text": "Short answer: The most recent official estimate: 123,802,000 people (123.802 million) — this is the Statistics Bureau of Japan’s estimate for the total population as of October 1, 2024. ([stat.go.jp](https://www.stat.go.jp/english/data/jinsui/2024np/index.html?utm_source=openai))\n\nNotes / other common estimates\n- The United Nations' World Population Prospects (mid‑year 2025 estimate, medium variant) and datasets yield a mid‑2025 figure of about 123.1 million (different sources interpolate mid‑year values slightly differently). ([statisticstimes.com](https://statisticstimes.com/demographics/country/japan-population.php?utm_source=openai)) \n- Real‑time aggregators that produce daily \"live\" counters (e.g., Worldometer) show a slightly different number because they extrapolate from different baseline data and update continuously (Worldometer showed ~122.9 million on Oct 19, 2025). ([srv1.worldometers.info](https://srv1.worldometers.info/world-population/japan-population/?utm_source=openai))\n\nWhy numbers differ: sources use different reference dates (e.g., Oct 1 of each year, mid‑year July 1) and methods (census/register‑based counts vs. demographic projections), so small discrepancies are normal.\n\nWould you like me to fetch the very latest live estimate (timestamped to today, Oct 20, 2025) and show the source?"
}
],
"usage": {
"input_tokens": 21229,
"output_tokens": 1889
}
}
```
## Call the OpenAI Responses API with Azure
You can call the OpenAI Responses API with Azure by setting `api_base` in your configuration to your Azure deployment URL.
```toml theme={null}
[models.azure-gpt-5-mini-responses]
routing = ["azure"]
[models.azure-gpt-5-mini-responses.providers.azure]
type = "openai" # CAREFUL: not `azure`!
api_base = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/" # TODO: Insert your API base URL here
api_key_location = "env::AZURE_API_KEY"
model_name = "gpt-5-mini"
api_type = "responses"
```
The `azure` model provider does not support the Responses API.
You must use the `openai` provider with a custom `api_base` instead.
# TensorZero Gateway Clients
Source: https://www.tensorzero.com/docs/gateway/clients
The TensorZero Gateway can be used with the TensorZero Python client, with OpenAI clients (e.g. Python/Node), or via its HTTP API in any programming language.
The TensorZero Gateway can be used with the **TensorZero Python client**, with **OpenAI clients (e.g. Python/Node)**, or via its **HTTP API in any programming language**.
## Python
### TensorZero Client
The TensorZero client offers the most flexibility.
It can be used with a built-in embedded (in-memory) gateway or a standalone HTTP gateway.
Additionally, it can be used synchronously or asynchronously.
You can install the TensorZero Python client with `pip install tensorzero`.
#### Embedded Gateway
The TensorZero Client includes a built-in embedded (in-memory) gateway, so you don't need to run a separate service.
##### Synchronous
```python theme={null}
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_embedded(
clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero", # optional: for observability
config_file="config/tensorzero.toml", # optional: for custom functions, models, metrics, etc.
) as client:
response = client.inference(
model_name="openai::gpt-4o-mini", # or: function_name="your_function_name"
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
)
```
##### Asynchronous
```python theme={null}
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_embedded(
clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero", # optional: for observability
config_file="config/tensorzero.toml", # optional: for custom functions, models, metrics, etc.
) as gateway:
inference_response = await gateway.inference(
model_name="openai::gpt-4o-mini", # or: function_name="your_function_name"
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
)
feedback_response = await gateway.feedback(
inference_id=inference_response.inference_id,
metric_name="task_success", # assuming a `task_success` metric is configured
value=True,
)
```
You can avoid the `await` in `build_embedded` by setting `async_setup=False`.
This is useful for synchronous contexts like `__init__` functions where `await` cannot be used.
However, avoid using it in asynchronous contexts as it blocks the event loop.
For async contexts, use the default `async_setup=True` with await.
For example, it's safe to use `async_setup=False` when initializing a FastAPI server, but not while the server is actively handling requests.
#### Standalone HTTP Gateway
The TensorZero Client can optionally be used with a standalone HTTP Gateway instead.
##### Synchronous
```python theme={null}
from tensorzero import TensorZeroGateway
# Assuming the TensorZero Gateway is running on localhost:3000...
with TensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
# Same as above...
```
##### Asynchronous
```python theme={null}
from tensorzero import AsyncTensorZeroGateway
# Assuming the TensorZero Gateway is running on localhost:3000...
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
# Same as above...
```
You can avoid the `await` in `build_http` by setting `async_setup=False`.
See above for more details.
### OpenAI Python Client
You can use the OpenAI Python client to run inference requests with TensorZero.
You need to use the TensorZero Client for feedback requests.
#### Embedded Gateway
You can run an embedded (in-memory) TensorZero Gateway with the OpenAI Python client, which doesn't require a separate service.
```python theme={null}
from openai import OpenAI
from tensorzero import patch_openai_client
client = OpenAI() # or AsyncOpenAI
await patch_openai_client(
client,
config_file="path/to/tensorzero.toml",
clickhouse_url="https://user:password@host:port/database",
)
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-4o-mini",
messages=[
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
],
)
```
You can avoid the `await` in `patch_openai_client` by setting `async_setup=False`.
See above for more details.
#### Standalone HTTP Gateway
You can deploy the TensorZero Gateway as a separate service and configure the OpenAI client to talk to it.
See [Deployment](/deployment/tensorzero-gateway/) for instructions on how to deploy the TensorZero Gateway.
```python "base_url="http://localhost:3000/openai/v1" "tensorzero::model_name::openai::gpt-4o-mini" theme={null}
from openai import OpenAI
# Assuming the TensorZero Gateway is running on localhost:3000...
with OpenAI(base_url="http://localhost:3000/openai/v1") as client:
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-4o-mini",
messages=[
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
],
)
```
#### Usage Details
##### `model`
In the OpenAI client, the `model` parameter should be one of the following:
> **`tensorzero::function_name::`**
>
> For example, if you have a function named `generate_haiku`, you can use `tensorzero::function_name::generate_haiku`.
> **`tensorzero::model_name::`**
>
> For example, if you have a model named `my_model` in the config file, you can use `tensorzero::model_name::my_model`.
> Alternatively, you can use default models like `tensorzero::model_name::openai::gpt-4o-mini`.
##### TensorZero Parameters
You can include optional TensorZero parameters (e.g. `episode_id` and `variant_name`) by prefixing them with `tensorzero::` in the `extra_body` field in OpenAI client requests.
```python theme={null}
response = client.chat.completions.create(
# ...
extra_body={
"tensorzero::episode_id": "00000000-0000-0000-0000-000000000000",
},
)
```
## JavaScript / TypeScript / Node
### OpenAI Node Client
You can use the OpenAI client to run inference requests with TensorZero.
You can deploy the TensorZero Gateway as a separate service and configure the OpenAI client to talk to the TensorZero Gateway.
See [Deployment](/deployment/tensorzero-gateway/) for instructions on how to deploy the TensorZero Gateway.
```ts "base_url="http://localhost:3000/openai/v1" "tensorzero::model_name::openai::gpt-4o-mini" theme={null}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/openai/v1",
});
const response = await client.chat.completions.create({
model: "tensorzero::model_name::openai::gpt-4o-mini",
messages: [
{
role: "user",
content: "Write a haiku about TensorZero.",
},
],
});
```
See [OpenAI Python Client » Usage Details](#usage-details) above for instructions on how to use the `model` parameter and other technical details.
You can include optional TensorZero parameters (e.g. `episode_id` and `variant_name`) by prefixing them with `tensorzero::` in the body in OpenAI client requests.
```ts theme={null}
const result = await client.chat.completions.create({
// ...
"tensorzero::episode_id": "00000000-0000-0000-0000-000000000000",
});
```
## Other Languages and Platforms
The TensorZero Gateway exposes every feature via its HTTP API.
You can deploy the TensorZero Gateway as a standalone service and interact with it from any programming language by making HTTP requests.
See [Deployment](/deployment/tensorzero-gateway/) for instructions on how to deploy the TensorZero Gateway.
### TensorZero HTTP API
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-4o-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
}
}'
```
```bash theme={null}
curl -X POST "http://localhost:3000/feedback" \
-H "Content-Type: application/json" \
-d '{
"inference_id": "00000000-0000-0000-0000-000000000000",
"metric_name": "task_success",
"value": true,
}'
```
### OpenAI HTTP API
You can make OpenAI-compatible requests to the TensorZero Gateway.
```bash "http://localhost:3000/openai/v1/chat/completions" "tensorzero::model_name::openai::gpt-4o-mini" theme={null}
curl -X POST "http://localhost:3000/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "tensorzero::model_name::openai::gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
}'
```
See [OpenAI Python Client » Usage Details](#usage-details) above for instructions on how to use the `model` parameter and other technical details.
You can include optional TensorZero parameters (e.g. `episode_id` and `variant_name`) by prefixing them with `tensorzero::` in the body in OpenAI client requests.
```bash theme={null}
curl -X POST "http://localhost:3000/openai/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
// ...
"tensorzero::episode_id": "00000000-0000-0000-0000-000000000000"
}'
```
# Configuration Reference
Source: https://www.tensorzero.com/docs/gateway/configuration-reference
Learn how to configure the TensorZero Gateway.
The configuration file is the backbone of TensorZero.
It defines the behavior of the gateway, including the models and their providers, functions and their variants, tools, metrics, and more.
Developers express the behavior of LLM calls by defining the relevant prompt templates, schemas, and other parameters in this configuration file.
The configuration file is a TOML file with a few major sections (TOML tables): `gateway`, `clickhouse`, `postgres`, `models`, `model_providers`, `functions`, `variants`, `tools`, `metrics`, `rate_limiting`, and `object_storage`.
## `[gateway]`
The `[gateway]` section defines the behavior of the TensorZero Gateway.
### `auth.cache.enabled`
* **Type:** boolean
* **Required:** no (default: `true`)
Enable caching of authentication database queries.
When enabled, the gateway caches authentication results to reduce database load and improve performance.
See [Set up auth for TensorZero](/operations/set-up-auth-for-tensorzero) for more details.
### `auth.cache.ttl_ms`
* **Type:** integer
* **Required:** no (default: `1000`)
The time-to-live (TTL) in milliseconds for cached authentication queries.
By default, authentication results are cached for 1 second (1000 ms).
```toml title="tensorzero.toml" theme={null}
[gateway.auth.cache]
enabled = true
ttl_ms = 60_000 # Cache for one minute
```
See [Set up auth for TensorZero](/operations/set-up-auth-for-tensorzero) for more details.
### `auth.enabled`
* **Type:** boolean
* **Required:** no (default: `false`)
Enable authentication for the TensorZero Gateway.
When enabled, all gateway endpoints except `/status` and `/health` will require a valid API key.
You must set up Postgres to use authentication features.
API keys can be created and managed through the TensorZero UI or CLI.
```toml title="tensorzero.toml" theme={null}
[gateway]
auth.enabled = true
```
See [Set up auth for TensorZero](/operations/set-up-auth-for-tensorzero) for a complete guide.
### `base_path`
* **Type:** string
* **Required:** no (default: `/`)
If set, the gateway will prefix its HTTP endpoints with this base path.
For example, if `base_path` is set to `/custom/prefix`, the inference endpoint will become `/custom/prefix/inference` instead of `/inference`.
### `bind_address`
* **Type:** string
* **Required:** no (default: `[::]:3000`)
Defines the socket address (including port) to bind the TensorZero Gateway to.
You can bind the gateway to IPv4 and/or IPv6 addresses.
To bind to an IPv6 address, you can set this field to a value like `[::]:3000`.
Depending on the operating system, this value binds only to IPv6 (e.g. Windows) or to both (e.g. Linux by default).
```toml title="tensorzero.toml" theme={null}
[gateway]
# ...
bind_address = "0.0.0.0:3000"
# ...
```
### `debug`
* **Type:** boolean
* **Required:** no (default: `false`)
Typically, TensorZero will not include inputs and outputs in logs or errors to avoid leaking sensitive data.
It may be helpful during development to be able to see more information about requests and responses.
When this field is set to `true`, the gateway will log more verbose errors to assist with debugging.
### `disable_pseudonymous_usage_analytics`
* **Type:** boolean
* **Required:** no (default: `false`)
If set to `true`, TensorZero will not collect or share [pseudonymous usage analytics](/deployment/tensorzero-gateway/#disabling-pseudonymous-usage-analytics).
### `export.otlp.traces.enabled`
* **Type:** boolean
* **Required:** no (default: `false`)
Enable [exporting traces to an external OpenTelemetry-compatible observability system](/operations/export-opentelemetry-traces).
Note that you will still need to set the `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` environment variable. See the above-linked guide for details.
### `export.otlp.traces.extra_headers`
* **Type:** object (map of string to string)
* **Required:** no (default: `{}`)
Static headers to include in all OTLP trace export requests.
This is useful for adding metadata to OTLP exports.
These headers are merged with any dynamic headers sent via HTTP request headers.
When the same header key is present in both static and dynamic headers, the dynamic header value takes precedence.
```toml title="tensorzero.toml" theme={null}
[gateway.export.otlp.traces]
# ...
extra_headers.space_id = "123"
extra_headers."X-Custom-Header" = "custom-value"
# ...
```
Avoid storing sensitive credentials directly in configuration files. See
[Export OpenTelemetry traces](/operations/export-opentelemetry-traces) for
instructions on sending headers dynamically.
### `export.otlp.traces.format`
* **Type:** either "opentelemetry" or "openinference"
* **Required:** no (default: `"opentelemetry"`)
If set to `"opentelemetry"`, TensorZero will set `gen_ai` attributes based on the [OpenTelemetry GenAI semantic conventions](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
If set to `"openinference"`, TensorZero will set attributes based on the [OpenInference semantic conventions](https://github.com/Arize-ai/openinference/blob/main/spec/llm_spans.md).
### `fetch_and_encode_input_files_before_inference`
* **Type:** boolean
* **Required:** no (default: `false`)
Controls how the gateway handles remote input files (e.g., images, PDFs) during multimodal inference.
If set to `true`, the gateway will fetch remote input files and send them as a base64-encoded payload in the prompt.
This is recommended to ensure that TensorZero and the model providers see identical inputs, which is important for observability and reproducibility.
If set to `false`, TensorZero will forward the input file URLs directly to the model provider (when supported) and fetch them for observability in parallel with inference.
This can be more efficient, but may result in different content being observed if the URL content changes between when the provider fetches it and when TensorZero fetches it for observability.
### `global_outbound_http_timeout_ms`
* **Type:** integer
* **Required:** no (default: `900000` = 15 minutes)
Sets the global timeout in milliseconds for all outbound HTTP requests made by TensorZero to external services such as model providers and APIs.
By default, all HTTP requests will timeout after 15 minutes (900,000 ms).
This timeout is intentionally set high to accommodate slow model responses, but you can customize it based on your requirements.
The `global_outbound_http_timeout_ms` acts as an upper bound for all more specific timeout configurations in your system.
Any variant-level timeouts (e.g., `timeouts.non_streaming.total_ms`, `timeouts.streaming.ttft_ms`), provider-level timeouts, or embedding model timeouts must be less than or equal to this global timeout.
Setting this value too low may cause legitimate requests to timeout before receiving a response from the model provider.
### `metrics.tensorzero_inference_latency_overhead_seconds_histogram_buckets`
* **Type:** array of floats
* **Required:** no (default: disabled)
Enable the `tensorzero_inference_latency_overhead_seconds_histogram` Prometheus metric with the specified histogram buckets.
This metric tracks the latency overhead introduced by TensorZero on HTTP requests.
The buckets must be in strictly ascending order and contain at least one value.
```toml title="tensorzero.toml" theme={null}
[gateway.metrics]
tensorzero_inference_latency_overhead_seconds_histogram_buckets = [0.001, 0.01, 0.1]
```
See [Export Prometheus metrics](/operations/export-prometheus-metrics) for more details.
### `observability.async_writes`
* **Type:** boolean
* **Required:** no (default: `false`)
Enabling this setting will improve the latency of the gateway by offloading the responsibility of writing inferences, feedback, and other data to ClickHouse to a background task, instead of waiting for ClickHouse to complete the writes.
Each database insert is handled immediately in separate background tasks.
See the ["Optimize latency and throughput" guide](/deployment/optimize-latency-and-throughput) for best practices.
You can't enable `async_writes` and `batch_writes` at the same time.
If you enable this setting, make sure that the gateway lives long enough to complete the writes.
This can be problematic in serverless environments that terminate the gateway instance after the response is returned but before the writes are completed.
### `observability.batch_writes`
* **Type:** object
* **Required:** no (default: disabled)
Enabling this setting will improve the latency and throughput of the gateway by offloading the responsibility of writing inferences, feedback, and other data to ClickHouse to a background task, instead of waiting for ClickHouse to complete the writes.
With `batch_writes`, multiple records are collected and written together in batches to improve efficiency.
The `batch_writes` object supports the following fields:
* `enabled` (boolean): Must be set to `true` to enable batch writes
* `flush_interval_ms` (integer, optional): Maximum time in milliseconds to wait before flushing a batch (default: `100`)
* `max_rows` (integer, optional): Maximum number of rows to collect before flushing a batch (default: `1000`)
```toml tensorzero.toml theme={null}
[gateway]
# ...
observability.batch_writes = { enabled = true, flush_interval_ms = 200, max_rows = 500 }
# ...
```
See the ["Optimize latency and throughput" guide](/deployment/optimize-latency-and-throughput) for best practices.
You can't enable `async_writes` and `batch_writes` at the same time.
If you enable this setting, make sure that the gateway lives long enough to complete the writes.
This can be problematic in serverless environments that terminate the gateway instance after the response is returned but before the writes are completed.
### `observability.enabled`
* **Type:** boolean
* **Required:** no (default: `null`)
Enable the observability features of the TensorZero Gateway.
If `true`, the gateway will throw an error on startup if it fails to validate the ClickHouse connection.
If `null`, the gateway will log a warning but continue if ClickHouse is not available, and it will use ClickHouse if available.
If `false`, the gateway will not use ClickHouse.
```toml title="tensorzero.toml" theme={null}
[gateway]
# ...
observability.enabled = true
# ...
```
### `observability.disable_automatic_migrations`
* **Type:** boolean
* **Required:** no (default `false`)
Disable automatic running of the TensorZero migrations when the TensorZero Gateway launches.
If `true`, then the migrations are not applied upon launch and must instead be applied manually
by running `docker run --rm -e TENSORZERO_CLICKHOUSE_URL=$TENSORZERO_CLICKHOUSE_URL tensorzero/gateway:{version} --run-clickhouse-migrations` or `docker compose run --rm gateway --run-clickhouse-migrations`.
If `false`, then the migrations are run automatically upon launch.
### `relay`
Configure gateway relay to forward inference requests through another TensorZero Gateway.
See [Centralize auth, rate limits, and more](/operations/centralize-auth-rate-limits-and-more) for a complete guide.
#### `api_key_location`
* **Type:** string or object
* **Required:** no
Defines the location of the API key for authenticating with the relay gateway.
If unset, no API key will be sent.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none`.
See [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details.
```toml title="tensorzero.toml" theme={null}
[gateway.relay]
gateway_url = "http://relay-gateway:3000"
api_key_location = "env::TENSORZERO_RELAY_API_KEY"
# api_key_location = "dynamic::relay_api_key"
# api_key_location = { default = "dynamic::relay_api_key", fallback = "env::TENSORZERO_RELAY_API_KEY" }
```
#### `gateway_url`
* **Type:** string (URL)
* **Required:** no
The base URL of the relay gateway to forward inference requests to.
When set, all model inference requests will be forwarded to this gateway URL instead of calling the model providers directly.
```toml title="tensorzero.toml" theme={null}
[gateway.relay]
gateway_url = "http://relay-gateway:3000"
```
### `template_filesystem_access.base_path`
* **Type:** string
* **Required:** no (default disabled)
Set `template_filesystem_access.base_path` to allow MiniJinja templates to load sub-templates using the `{% include %}` and `{% import %}` directives.
The directives will be relative to `base_path` and can only access files within that directory or its subdirectories.
The `base_path` can be absolute or relative to the configuration file's location.
## `[models.model_name]`
The `[models.model_name]` section defines the behavior of a model.
You can define multiple models by including multiple `[models.model_name]` sections.
A model is provider agnostic, and the relevant providers are defined in the `providers` sub-section (see below).
If your `model_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `llama-3.1-8b-instruct` as `[models."llama-3.1-8b-instruct"]`.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5]
# fieldA = ...
# fieldB = ...
# ...
[models."llama-3.1-8b-instruct"]
# fieldA = ...
# fieldB = ...
# ...
```
### `routing`
* **Type:** array of strings
* **Required:** yes
A list of provider names to route requests to.
The providers must be defined in the `providers` sub-section (see below).
The TensorZero Gateway will attempt to route a request to the first provider in the list, and fallback to subsequent providers in order if the request is not successful.
```toml mark="openai" mark="azure" theme={null}
// tensorzero.toml
[models.gpt-4o]
# ...
routing = ["openai", "azure"]
# ...
[models.gpt-4o.providers.openai]
# ...
[models.gpt-4o.providers.azure]
# ...
```
### `skip_relay`
* **Type:** boolean
* **Required:** no (default: `false`)
When set to `true`, this model will bypass the [relay gateway](/operations/centralize-auth-rate-limits-and-more) and call its providers directly.
This is useful when you want certain models to skip centralized controls like rate limits or credential management.
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o-edge]
routing = ["openai"]
skip_relay = true
[models.gpt-4o-edge.providers.openai]
type = "openai"
model_name = "gpt-4o"
```
Models that skip the relay won't benefit from centralized rate limits, auth policies, or credential management enforced by the relay gateway.
The edge gateway must have the necessary provider credentials configured to make direct requests.
### `timeouts`
* **Type:** object
* **Required:** no
The `timeouts` object allows you to set granular timeouts for requests to this model.
You can define timeouts for non-streaming and streaming requests separately: `timeouts.non_streaming.total_ms` corresponds to the total request duration and `timeouts.streaming.ttft_ms` corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT):
```toml theme={null}
[models.model_name]
# ...
timeouts = { non_streaming.total_ms = 15000, streaming.ttft_ms = 3000 }
# ...
```
The specified timeouts apply to the scope of an entire model inference request, including all retries and fallbacks across its providers.
You can also set timeouts at the variant level and provider level.
Multiple timeouts can be active simultaneously.
## `[models.model_name.providers.provider_name]`
The `providers` sub-section defines the behavior of a specific provider for a model.
You can define multiple providers by including multiple `[models.model_name.providers.provider_name]` sections.
If your `provider_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `vllm.internal` as `[models.model_name.providers."vllm.internal"]`.
```toml mark="gpt-4o" mark="openai" mark="azure" theme={null}
// tensorzero.toml
[models.gpt-4o]
# ...
routing = ["openai", "azure"]
# ...
[models.gpt-4o.providers.openai]
# ...
[models.gpt-4o.providers.azure]
# ...
```
### `extra_body`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_body` field allows you to modify the request body that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two fields:
* `pointer`: A [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) string specifying where to modify the request body
* Use `-` as the final path element to append to an array (e.g., `/messages/-` appends to `messages`)
* One of the following:
* `value`: The value to insert at that location; it can be of any type including nested types
* `delete = true`: Deletes the field at the specified location, if present.
You can also set `extra_body` for a variant entry.
The model provider `extra_body` entries take priority over variant `extra_body` entries.
Additionally, you can set `extra_body` at inference-time.
The values provided at inference-time take priority over the values in the configuration file.
If TensorZero would normally send this request body to the provider...
```json theme={null}
{
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": true
}
}
```
...then the following `extra_body`...
```toml theme={null}
extra_body = [
{ pointer = "/agi", value = true},
{ pointer = "/safety_checks/no_agi", value = { bypass = "on" }}
]
```
...overrides the request body to:
```json theme={null}
{
"agi": true,
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": {
"bypass": "on"
}
}
}
```
### `extra_headers`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_headers` field allows you to set or overwrite the request headers that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two fields:
* `name` (string): The name of the header to modify (e.g. `anthropic-beta`)
* One of the following:
* `value` (string): The value of the header (e.g. `token-efficient-tools-2025-02-19`)
* `delete = true`: Deletes the header from the request, if present
You can also set `extra_headers` for a variant entry.
The model provider `extra_headers` entries take priority over variant `extra_headers` entries.
If TensorZero would normally send the following request headers to the provider...
```text theme={null}
Safety-Checks: on
```
...then the following `extra_headers`...
```toml theme={null}
extra_headers = [
{ name = "Safety-Checks", value = "off"},
{ name = "Intelligence-Level", value = "AGI"}
]
```
...overrides the request headers to:
```text theme={null}
Safety-Checks: off
Intelligence-Level: AGI
```
### `timeouts`
* **Type:** object
* **Required:** no
The `timeouts` object allows you to set granular timeouts for individual requests to a model provider.
You can define timeouts for non-streaming and streaming requests separately: `timeouts.non_streaming.total_ms` corresponds to the total request duration and `timeouts.streaming.ttft_ms` corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT):
```toml theme={null}
[models.model_name.providers.provider_name]
# ...
timeouts = { non_streaming.total_ms = 15000, streaming.ttft_ms = 3000 }
# ...
```
This setting applies to individual requests to the model provider.
If you're using an advanced variant type that performs multiple requests, the timeout will apply to each request separately.
If you've defined retries and fallbacks, the timeout will apply to each retry and fallback separately.
This setting is particularly useful if you'd like to retry or fallback on a request that's taking too long.
You can also set timeouts at the model level and provider level.
Multiple timeouts can be active simultaneously.
Separately, you can set a global timeout for the entire inference request using the TensorZero client's `timeout` field (or simply killing the request if you're using a different client).
### `type`
* **Type:** string
* **Required:** yes
Defines the types of the provider. See [Integrations » Model Providers](/gateway/api-reference/inference/#content-block) for details.
The supported provider types are `anthropic`, `aws_bedrock`, `aws_sagemaker`, `azure`, `deepseek`, `fireworks`, `gcp_vertex_anthropic`, `gcp_vertex_gemini`, `google_ai_studio_gemini`, `groq`, `hyperbolic`, `mistral`, `openai`, `openrouter`, `sglang`, `tgi`, `together`, `vllm`, and `xai`.
The other fields in the provider sub-section depend on the provider type.
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o.providers.azure]
# ...
type = "azure"
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the Anthropic API.
See Anthropic's documentation for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.anthropic]
# ...
type = "anthropic"
model_name = "claude-haiku-4-5"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::ANTHROPIC_API_KEY` unless set otherwise in `provider_type.anthropic.defaults.api_key_location`)
Defines the location of the API key for the Anthropic provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none`.
See [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.anthropic]
# ...
type = "anthropic"
api_key_location = "dynamic::anthropic_api_key"
# api_key_location = "env::ALTERNATE_ANTHROPIC_API_KEY"
# api_key_location = { default = "dynamic::anthropic_api_key", fallback = "env::ANTHROPIC_API_KEY" }
# ...
```
##### `api_base`
* **Type:** string
* **Required:** no (default: `https://api.anthropic.com/v1/messages`)
Overrides the base URL used for Anthropic Messages API requests. The value should include the full endpoint path (for example `https://example.com/v1/messages`).
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.anthropic]
# ...
type = "anthropic"
api_base = "https://example.com/v1/messages"
# ...
```
##### `beta_structured_outputs`
* **Type:** boolean
* **Required:** no (default: `false`)
Enables strict validation for tool parameters when using tools with `strict = true`.
When enabled:
* Adds the `anthropic-beta: structured-outputs-2025-11-13` header to requests
* For tools with `strict = true`, forwards the `strict` parameter to enable strict validation
For JSON functions with `json_mode = "strict"`, TensorZero automatically uses Anthropic's structured outputs feature without this setting. This setting is only needed for strict tool parameter validation.
##### `access_key_id`
* **Type:** string
* **Required:** no
AWS access key ID for authentication. If not specified, uses AWS SDK default credential chain.
The supported locations are:
* `env::VAR_NAME` - read from environment variable at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `sdk` - use AWS SDK default credential chain
When specified, `secret_access_key` must also be provided. Both fields must use the same source type (both `env::`, both `dynamic::`, or both `sdk`).
##### `endpoint_url`
* **Type:** string
* **Required:** no
Custom endpoint URL for the AWS Bedrock API. Useful for AWS PrivateLink, FIPS endpoints, AWS China/GovCloud regions, or local testing (e.g., LocalStack).
The supported locations are:
* Static URLs (e.g., `"https://bedrock-runtime.us-east-1.amazonaws.com"`)
* `env::VAR_NAME` - read from environment variable at startup
* `path::/path/to/file` - read from file at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `none` - treat as unspecified
See the [AWS Bedrock service endpoints documentation](https://docs.aws.amazon.com/general/latest/gr/bedrock.html) for available endpoints.
AWS China regions (`cn-north-1`, `cn-northwest-1`) and AWS GovCloud regions use different DNS suffixes than standard AWS regions.
For these partitions, you must specify the full `endpoint_url`.
For example:
```toml theme={null}
endpoint_url = "https://bedrock-runtime.cn-north-1.amazonaws.com.cn"
```
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.aws_bedrock]
type = "aws_bedrock"
model_id = "anthropic.claude-haiku-4-5-v1:0"
region = "us-east-1"
endpoint_url = "https://bedrock-runtime.us-east-1.amazonaws.com"
# or: endpoint_url = "env::BEDROCK_ENDPOINT_URL"
# or: endpoint_url = "path::/etc/secrets/bedrock-endpoint"
# or: endpoint_url = "dynamic::bedrock-endpoint-url"
```
When using `dynamic::` endpoints, untrusted clients can specify arbitrary
endpoints, potentially enabling credential exfiltration. Only use dynamic
endpoints when all clients are trusted.
##### `model_id`
* **Type:** string
* **Required:** yes
Defines the model ID to use with the AWS Bedrock API.
See AWS Bedrock's documentation for the list of available model IDs.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.aws_bedrock]
# ...
type = "aws_bedrock"
model_id = "anthropic.claude-haiku-4-5-v1:0"
# ...
```
Many AWS Bedrock models are only available through cross-region inference profiles.
For those models, the `model_id` requires special prefix (e.g. the `us.` prefix in `us.anthropic.claude-sonnet-4-5-20250929-v1:0`).
See the [AWS documentation on inference profiles](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles-support.html).
##### `region`
* **Type:** string
* **Required:** yes
Defines the AWS region to use with the AWS Bedrock API.
The supported locations are:
* Static values (e.g., `"us-east-1"`)
* `env::VAR_NAME` - read from environment variable at startup
* `path::/path/to/file` - read from file at startup
* `dynamic::key_name` - resolve at inference time from `credentials` field
* `sdk` - use AWS SDK auto-detection (may slow down initialization in non-AWS environments)
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.aws_bedrock]
# ...
type = "aws_bedrock"
region = "us-east-2"
# or: region = "env::AWS_BEDROCK_REGION"
# or: region = "path::/etc/secrets/aws-region"
# or: region = "dynamic::bedrock-region"
# or: region = "sdk" # auto-detect using AWS SDK
# ...
```
##### `secret_access_key`
* **Type:** string
* **Required:** no (required if `access_key_id` is specified)
AWS secret access key for authentication.
The supported locations are:
* `env::VAR_NAME` - read from environment variable at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `sdk` - use AWS SDK default credential chain
##### `session_token`
* **Type:** string
* **Required:** no
AWS session token for temporary credentials (e.g., when using IAM roles or STS).
The supported locations are:
* `env::VAR_NAME` - read from environment variable at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `sdk` - use AWS SDK default credential chain
Must use the same source type as `access_key_id` and `secret_access_key`.
##### `access_key_id`
* **Type:** string
* **Required:** no
AWS access key ID for authentication. If not specified, uses AWS SDK default credential chain.
The supported locations are:
* `env::VAR_NAME` - read from environment variable at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `sdk` - use AWS SDK default credential chain
When specified, `secret_access_key` must also be provided. Both fields must use the same source type (both `env::`, both `dynamic::`, or both `sdk`).
##### `endpoint_name`
* **Type:** string
* **Required:** yes
Defines the endpoint name to use with the AWS SageMaker API.
##### `endpoint_url`
* **Type:** string
* **Required:** no
Custom endpoint URL for the AWS SageMaker API. Useful for AWS PrivateLink, FIPS endpoints, AWS China/GovCloud regions, or local testing (e.g., LocalStack).
The supported locations are:
* Static URLs (e.g., `"https://runtime.sagemaker.us-east-1.amazonaws.com"`)
* `env::VAR_NAME` - read from environment variable at startup
* `path::/path/to/file` - read from file at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `none` - treat as unspecified
See the [AWS SageMaker service endpoints documentation](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for available endpoints.
AWS China regions (`cn-north-1`, `cn-northwest-1`) and AWS GovCloud regions use different DNS suffixes than standard AWS regions.
For these partitions, you must specify the full `endpoint_url`.
For example:
```toml theme={null}
endpoint_url = "https://bedrock-runtime.cn-north-1.amazonaws.com.cn"
```
```toml title="tensorzero.toml" theme={null}
[models.my-model.providers.aws_sagemaker]
type = "aws_sagemaker"
endpoint_name = "my-endpoint"
model_name = "gemma3:1b"
hosted_provider = "openai"
region = "us-east-1"
endpoint_url = "https://runtime.sagemaker.us-east-1.amazonaws.com"
# or: endpoint_url = "env::SAGEMAKER_ENDPOINT_URL"
# or: endpoint_url = "path::/etc/secrets/sagemaker-endpoint"
# or: endpoint_url = "dynamic::sagemaker-endpoint-url"
```
When using `dynamic::` endpoints, untrusted clients can specify arbitrary
endpoints, potentially enabling credential exfiltration. Only use dynamic
endpoints when all clients are trusted.
##### `hosted_provider`
* **Type:** string
* **Required:** yes
Defines the underlying model provider to use with the SageMaker API.
The `aws_sagemaker` provider is a wrapper on other providers.
Currently, the only supported `hosted_provider` options are:
* `openai` (including any OpenAI-compatible server e.g. Ollama)
* `tgi`
For example, if you're using Ollama, you can set:
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.aws_sagemaker]
# ...
type = "aws_sagemaker"
hosted_provider = "openai"
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the AWS SageMaker API.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.aws_sagemaker]
# ...
type = "aws_sagemaker"
model_name = "gemma3:1b"
# ...
```
##### `region`
* **Type:** string
* **Required:** yes
Defines the AWS region to use with the AWS SageMaker API.
The supported locations are:
* Static values (e.g., `"us-east-1"`)
* `env::VAR_NAME` - read from environment variable at startup
* `path::/path/to/file` - read from file at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `sdk` - use AWS SDK auto-detection (may slow down initialization in non-AWS environments)
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.aws_sagemaker]
# ...
type = "aws_sagemaker"
region = "us-east-2"
# or: region = "env::AWS_SAGEMAKER_REGION"
# or: region = "path::/etc/secrets/aws-region"
# or: region = "dynamic::sagemaker-region"
# or: region = "sdk" # auto-detect using AWS SDK
# ...
```
##### `secret_access_key`
* **Type:** string
* **Required:** no (required if `access_key_id` is specified)
AWS secret access key for authentication.
The supported locations are:
* `env::VAR_NAME` - read from environment variable at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `sdk` - use AWS SDK default credential chain
##### `session_token`
* **Type:** string
* **Required:** no
AWS session token for temporary credentials (e.g., when using IAM roles or STS).
The supported locations are:
* `env::VAR_NAME` - read from environment variable at startup
* `dynamic::key_name` - resolve at request time from `credentials` field
* `sdk` - use AWS SDK default credential chain
Must use the same source type as `access_key_id` and `secret_access_key`.
The TensorZero Gateway handles the API version under the hood (currently `2025-04-01-preview`).
You only need to set the `deployment_id` and `endpoint` fields.
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::AZURE_API_KEY` unless set otherwise in `provider_type.azure.defaults.api_key_location`)
Defines the location of the API key for the Azure OpenAI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none`.
See [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details.
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o-mini.providers.azure]
# ...
type = "azure"
api_key_location = "dynamic::azure_api_key"
# api_key_location = "env::ALTERNATE_AZURE_API_KEY"
# api_key_location = { default = "dynamic::azure_api_key", fallback = "env::AZURE_API_KEY" }
# ...
```
##### `deployment_id`
* **Type:** string
* **Required:** yes
Defines the deployment ID of the Azure OpenAI deployment.
See Azure OpenAI's documentation for the list of available models.
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o-mini.providers.azure]
# ...
type = "azure"
deployment_id = "gpt4o-mini-20240718"
# ...
```
##### `endpoint`
* **Type:** string
* **Required:** yes
Defines the endpoint of the Azure OpenAI deployment (protocol and hostname).
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o-mini.providers.azure]
# ...
type = "azure"
endpoint = "https://.openai.azure.com"
# ...
```
If the endpoint starts with `env::`, the succeeding value will be treated as an environment variable name and the gateway will attempt to retrieve the value from the environment on startup.
If the endpoint starts with `dynamic::`, the succeeding value will be treated as an dynamic credential name and the gateway will attempt to retrieve the value from the `dynamic_credentials` field on each inference it is needed.
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::DEEPSEEK_API_KEY` unless set otherwise in `provider_type.deepseek.defaults.api_key_location`)
Defines the location of the API key for the DeepSeek provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.deepseek_chat.providers.deepseek]
# ...
type = "deepseek"
api_key_location = "dynamic::deepseek_api_key"
# api_key_location = "env::ALTERNATE_DEEPSEEK_API_KEY"
# api_key_location = { default = "dynamic::deepseek_api_key", fallback = "env::DEEPSEEK_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the DeepSeek API.
Currently supported models are `deepseek-chat` (DeepSeek-v3) and `deepseek-reasoner` (R1).
```toml title="tensorzero.toml" theme={null}
[models.deepseek_chat.providers.deepseek]
# ...
type = "deepseek"
model_name = "deepseek-chat"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::FIREWORKS_API_KEY` unless set otherwise in `provider_type.fireworks.defaults.api_key_location`)
Defines the location of the API key for the Fireworks provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models."llama-3.1-8b-instruct".providers.fireworks]
# ...
type = "fireworks"
api_key_location = "dynamic::fireworks_api_key"
# api_key_location = "env::ALTERNATE_FIREWORKS_API_KEY"
# api_key_location = { default = "dynamic::fireworks_api_key", fallback = "env::FIREWORKS_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the Fireworks API.
See Fireworks' documentation for the list of available model names.
You can also deploy your own models on Fireworks AI.
```toml title="tensorzero.toml" theme={null}
[models."llama-3.1-8b-instruct".providers.fireworks]
# ...
type = "fireworks"
model_name = "accounts/fireworks/models/llama-v3p3-70b-instruct"
# ...
```
##### `credential_location`
* **Type:** string or object
* **Required:** no (default: `path_from_env::GCP_VERTEX_CREDENTIALS_PATH` unless otherwise set in `provider_type.gcp_vertex_anthropic.defaults.credential_location`)
Defines the location of the credentials for the GCP Vertex Anthropic provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::PATH_TO_CREDENTIALS_FILE`, `path_from_env::ENVIRONMENT_VARIABLE`, `dynamic::CREDENTIALS_ARGUMENT_NAME`, `path::PATH_TO_CREDENTIALS_FILE`, and `sdk` (use Google Cloud SDK to auto-discover credentials).
See [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.gcp_vertex]
# ...
type = "gcp_vertex_anthropic"
credential_location = "dynamic::gcp_credentials_path"
# credential_location = "path_from_env::GCP_VERTEX_CREDENTIALS_PATH"
# credential_location = "path::/etc/secrets/gcp-key.json"
# credential_location = "sdk"
# credential_location = { default = "sdk", fallback = "path::/etc/secrets/gcp-key.json" }
# ...
```
##### `endpoint_id`
* **Type:** string
* **Required:** no (exactly one of `endpoint_id` or `model_id` must be set)
Defines the endpoint ID of the GCP Vertex AI Anthropic model.
Use `model_id` for off-the-shelf models and `endpoint_id` for fine-tuned models and custom endpoints.
##### `location`
* **Type:** string
* **Required:** yes
Defines the location (region) of the GCP Vertex AI Anthropic model.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.gcp_vertex]
# ...
type = "gcp_vertex_anthropic"
location = "us-central1"
# ...
```
##### `model_id`
* **Type:** string
* **Required:** no (exactly one of `model_id` or `endpoint_id` must be set)
Defines the model ID of the GCP Vertex AI model.
See Anthropic's GCP documentation for the list of available model IDs.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5.providers.gcp_vertex]
# ...
type = "gcp_vertex_anthropic"
model_id = "claude-haiku-4-5@20251001"
# ...
```
Use `model_id` for off-the-shelf models and `endpoint_id` for fine-tuned models and custom endpoints.
##### `project_id`
* **Type:** string
* **Required:** yes
Defines the project ID of the GCP Vertex AI model.
```toml title="tensorzero.toml" theme={null}
[models.claude-haiku-4-5-2024030.providers.gcp_vertex]
# ...
type = "gcp_vertex"
project_id = "your-project-id"
# ...
```
##### `credential_location`
* **Type:** string or object
* **Required:** no (default: `path_from_env::GCP_VERTEX_CREDENTIALS_PATH` unless otherwise set in `provider_type.gcp_vertex_gemini.defaults.credential_location`)
Defines the location of the credentials for the GCP Vertex Gemini provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::PATH_TO_CREDENTIALS_FILE`, `path_from_env::ENVIRONMENT_VARIABLE`, `dynamic::CREDENTIALS_ARGUMENT_NAME`, `path::PATH_TO_CREDENTIALS_FILE`, and `sdk` (use Google Cloud SDK to auto-discover credentials).
See [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details.
```toml title="tensorzero.toml" theme={null}
[models."gemini-2.5-flash".providers.gcp_vertex]
# ...
type = "gcp_vertex_gemini"
credential_location = "dynamic::gcp_credentials_path"
# credential_location = "path_from_env::GCP_VERTEX_CREDENTIALS_PATH"
# credential_location = "path::/etc/secrets/gcp-key.json"
# credential_location = "sdk"
# credential_location = { default = "sdk", fallback = "path::/etc/secrets/gcp-key.json" }
# ...
```
##### `endpoint_id`
* **Type:** string
* **Required:** no (exactly one of `endpoint_id` or `model_id` must be set)
Defines the endpoint ID of the GCP Vertex AI Gemini model.
Use `model_id` for off-the-shelf models and `endpoint_id` for fine-tuned models and custom endpoints.
##### `location`
* **Type:** string
* **Required:** yes
Defines the location (region) of the GCP Vertex Gemini model.
```toml title="tensorzero.toml" theme={null}
[models."gemini-2.5-flash".providers.gcp_vertex]
# ...
type = "gcp_vertex_gemini"
location = "us-central1"
# ...
```
##### `model_id`
* **Type:** string
* **Required:** no (exactly one of `model_id` or `endpoint_id` must be set)
Defines the model ID of the GCP Vertex AI model.
See [GCP Vertex AI's documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions) for the list of available model IDs.
```toml title="tensorzero.toml" theme={null}
[models."gemini-2.5-flash".providers.gcp_vertex]
# ...
type = "gcp_vertex_gemini"
model_id = "gemini-2.5-flash"
# ...
```
##### `project_id`
* **Type:** string
* **Required:** yes
Defines the project ID of the GCP Vertex AI model.
```toml title="tensorzero.toml" theme={null}
[models."gemini-2.5-flash".providers.gcp_vertex]
# ...
type = "gcp_vertex_gemini"
project_id = "your-project-id"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::GOOGLE_AI_STUDIO_API_KEY` unless otherwise set in `provider_type.google_ai_studio.defaults.credential_location`)
Defines the location of the API key for the Google AI Studio Gemini provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models."gemini-2.5-flash".providers.google_ai_studio_gemini]
# ...
type = "google_ai_studio_gemini"
api_key_location = "dynamic::google_ai_studio_api_key"
# api_key_location = "env::ALTERNATE_GOOGLE_AI_STUDIO_API_KEY"
# api_key_location = { default = "dynamic::google_ai_studio_api_key", fallback = "env::GOOGLE_AI_STUDIO_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the Google AI Studio Gemini API.
See [Google AI Studio's documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models."gemini-2.5-flash".providers.google_ai_studio_gemini]
# ...
type = "google_ai_studio_gemini"
model_name = "gemini-2.5-flash"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::GROQ_API_KEY` unless otherwise set in `provider_type.groq.defaults.credential_location`)
Defines the location of the API key for the Groq provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.llama4_scout_17b_16e_instruct.providers.groq]
# ...
type = "groq"
api_key_location = "dynamic::groq_api_key"
# api_key_location = "env::ALTERNATE_GROQ_API_KEY"
# api_key_location = { default = "dynamic::groq_api_key", fallback = "env::GROQ_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the Groq API.
See [Groq's documentation](https://groq.com/pricing) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models.llama4_scout_17b_16e_instruct.providers.groq]
# ...
type = "groq"
model_name = "meta-llama/llama-4-scout-17b-16e-instruct"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::HYPERBOLIC_API_KEY` unless otherwise set in `provider_type.hyperbolic.defaults.api_key_location`)
Defines the location of the API key for the Hyperbolic provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models."openai/gpt-oss-20b".providers.hyperbolic]
# ...
type = "hyperbolic"
api_key_location = "dynamic::hyperbolic_api_key"
# api_key_location = "env::ALTERNATE_HYPERBOLIC_API_KEY"
# api_key_location = { default = "dynamic::hyperbolic_api_key", fallback = "env::HYPERBOLIC_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the Hyperbolic API.
See [Hyperbolic's documentation](https://app.hyperbolic.xyz/models) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models."openai/gpt-oss-20b".providers.hyperbolic]
# ...
type = "hyperbolic"
model_name = "openai/gpt-oss-20b"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::MISTRAL_API_KEY` unless otherwise set in `provider_type.mistral.defaults.api_key_location`)
Defines the location of the API key for the Mistral provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models."open-mistral-nemo".providers.mistral]
# ...
type = "mistral"
api_key_location = "dynamic::mistral_api_key"
# api_key_location = "env::ALTERNATE_MISTRAL_API_KEY"
# api_key_location = { default = "dynamic::mistral_api_key", fallback = "env::MISTRAL_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the Mistral API.
See [Mistral's documentation](https://docs.mistral.ai/getting-started/models/) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models."open-mistral-nemo".providers.mistral]
# ...
type = "mistral"
model_name = "open-mistral-nemo-2407"
# ...
```
##### `api_base`
* **Type:** string
* **Required:** no (default: `https://api.openai.com/v1/`)
Defines the base URL of the OpenAI API.
You can use the `api_base` field to use an API provider that is compatible with the OpenAI API.
However, many providers are only "approximately compatible" with the OpenAI API, so you might need to use a specialized model provider in those cases.
```toml title="tensorzero.toml" theme={null}
[models."gpt-4o".providers.openai]
# ...
type = "openai"
api_base = "https://api.openai.com/v1/"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::OPENAI_API_KEY` unless otherwise set in `provider_types.openai.defaults.api_key_location`)
Defines the location of the API key for the OpenAI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o-mini.providers.openai]
# ...
type = "openai"
api_key_location = "dynamic::openai_api_key"
# api_key_location = "env::ALTERNATE_OPENAI_API_KEY"
# api_key_location = "none"
# api_key_location = { default = "dynamic::openai_api_key", fallback = "env::OPENAI_API_KEY" }
# ...
```
##### `api_type`
* **Type:** string
* **Required:** no (default: `chat_completions`)
Determines which OpenAI API endpoint to use.
The default value is `chat_completions` for the standard Chat Completions API.
Set to `responses` to use the Responses API, which provides access to built-in tools like web search and reasoning capabilities.
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses.providers.openai]
# ...
type = "openai"
api_type = "responses"
# ...
```
##### `include_encrypted_reasoning`
* **Type:** boolean
* **Required:** no (default: `false`)
Enables encrypted reasoning (thought blocks) when using the Responses API.
This parameter allows the model to show its internal reasoning process before generating the final response.
**Only available when `api_type = "responses"`.**
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses.providers.openai]
# ...
type = "openai"
api_type = "responses"
include_encrypted_reasoning = true
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the OpenAI API.
See [OpenAI's documentation](https://platform.openai.com/docs/models) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o-mini.providers.openai]
# ...
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
# ...
```
##### `provider_tools`
* **Type:** array of objects
* **Required:** no (default: `[]`)
Defines provider-specific built-in tools that are available for this model provider.
These are tools that run server-side on the provider's infrastructure (e.g., OpenAI's web search tool).
Each object in the array should contain the provider-specific tool configuration as defined by the provider's API.
For example, OpenAI's Responses API supports a `web_search` tool that enables the model to search the web for information.
This field can be set statically in the configuration file or dynamically at inference time via the `provider_tools` parameter in the `/inference` endpoint or `tensorzero::provider_tools` in the OpenAI-compatible endpoint.
See the [Inference API Reference](/gateway/api-reference/inference/#provider_tools) for more details on dynamic usage.
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses-web-search.providers.openai]
# ...
type = "openai"
api_type = "responses"
provider_tools = [{type = "web_search"}] # Enable OpenAI's built-in web search tool
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::OPENROUTER_API_KEY` unless otherwise set in `provider_types.openrouter.defaults.api_key_location`)
Defines the location of the API key for the OpenRouter provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.gpt4_turbo.providers.openrouter]
# ...
type = "openrouter"
api_key_location = "dynamic::openrouter_api_key"
# api_key_location = "env::ALTERNATE_OPENROUTER_API_KEY"
# api_key_location = { default = "dynamic::openrouter_api_key", fallback = "env::OPENROUTER_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the OpenRouter API.
See [OpenRouter's documentation](https://openrouter.ai/models) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models.gpt4_turbo.providers.openrouter]
# ...
type = "openrouter"
model_name = "openai/gpt4.1"
# ...
```
##### `api_base`
* **Type:** string
* **Required:** yes
Defines the base URL of the SGLang API.
```toml title="tensorzero.toml" theme={null}
[models.llama.providers.sglang]
# ...
type = "sglang"
api_base = "http://localhost:8080/v1/"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `none`)
Defines the location of the API key for the SGLang provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.llama.providers.sglang]
# ...
type = "sglang"
api_key_location = "dynamic::sglang_api_key"
# api_key_location = "env::ALTERNATE_SGLANG_API_KEY"
# api_key_location = "none" # if authentication is disabled
# api_key_location = { default = "dynamic::sglang_api_key", fallback = "env::SGLANG_API_KEY" }
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::TOGETHER_API_KEY` unless otherwise set in `provider_types.together.defaults.api_key_location`)
Defines the location of the API key for the Together AI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.llama3_3_70b_instruct_turbo.providers.together]
# ...
type = "together"
api_key_location = "dynamic::together_api_key"
# api_key_location = "env::ALTERNATE_TOGETHER_API_KEY"
# api_key_location = { default = "dynamic::together_api_key", fallback = "env::TOGETHER_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the Together API.
See [Together's documentation](https://docs.together.ai/docs/chat-models) for the list of available model names.
You can also deploy your own models on Together AI.
```toml title="tensorzero.toml" theme={null}
[models.llama3_3_70b_instruct_turbo.providers.together]
# ...
type = "together"
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
# ...
```
##### `api_base`
* **Type:** string
* **Required:** yes (default: `http://localhost:8000/v1/`)
Defines the base URL of the VLLM API.
```toml title="tensorzero.toml" theme={null}
[models."phi-3.5-mini-instruct".providers.vllm]
# ...
type = "vllm"
api_base = "http://localhost:8000/v1/"
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the vLLM API.
```toml title="tensorzero.toml" theme={null}
[models."phi-3.5-mini-instruct".providers.vllm]
# ...
type = "vllm"
model_name = "microsoft/Phi-3.5-mini-instruct"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::VLLM_API_KEY`)
Defines the location of the API key for the vLLM provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models."phi-3.5-mini-instruct".providers.vllm]
# ...
type = "vllm"
api_key_location = "dynamic::vllm_api_key"
# api_key_location = "env::ALTERNATE_VLLM_API_KEY"
# api_key_location = "none"
# api_key_location = { default = "dynamic::vllm_api_key", fallback = "env::VLLM_API_KEY" }
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::XAI_API_KEY` unless otherwise set in `provider_types.xai.defaults.api_key_location`)
Defines the location of the API key for the xAI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.grok_4_1_fast_non_reasoning.providers.xai]
# ...
type = "xai"
api_key_location = "dynamic::xai_api_key"
# api_key_location = "env::ALTERNATE_XAI_API_KEY"
# api_key_location = { default = "dynamic::xai_api_key", fallback = "env::XAI_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the xAI API.
See [xAI's documentation](https://docs.x.ai/docs/models) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[models.grok_4_1_fast_non_reasoning.providers.xai]
# ...
type = "xai"
model_name = "grok-4-1-fast-non-reasoning"
# ...
```
##### `api_base`
* **Type:** string
* **Required:** yes
Defines the base URL of the TGI API.
```toml title="tensorzero.toml" theme={null}
[models.phi_4.providers.tgi]
# ...
type = "tgi"
api_base = "http://localhost:8080/v1/"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `none`)
Defines the location of the API key for the TGI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[models.phi_4.providers.tgi]
# ...
type = "tgi"
api_key_location = "dynamic::tgi_api_key"
# api_key_location = "env::ALTERNATE_TGI_API_KEY"
# api_key_location = "none" # if authentication is disabled
# api_key_location = { default = "dynamic::tgi_api_key", fallback = "env::TGI_API_KEY" }
# ...
```
## `[embedding_models.model_name]`
The `[embedding_models.model_name]` section defines the behavior of an embedding model.
You can define multiple models by including multiple `[embedding_models.model_name]` sections.
A model is provider agnostic, and the relevant providers are defined in the `providers` sub-section (see below).
If your `model_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `embedding-0.1` as `[embedding_models."embedding-0.1"]`.
```toml title="tensorzero.toml" theme={null}
[embedding_models.openai-text-embedding-3-small]
# fieldA = ...
# fieldB = ...
# ...
[embedding_models."t0-text-embedding-3.5-massive"]
# fieldA = ...
# fieldB = ...
# ...
```
### `routing`
* **Type:** array of strings
* **Required:** yes
A list of provider names to route requests to.
The providers must be defined in the `providers` sub-section (see below).
The TensorZero Gateway will attempt to route a request to the first provider in the list, and fallback to subsequent providers in order if the request is not successful.
```toml mark="openai" mark="azure" theme={null}
// tensorzero.toml
[embedding_models.model-name]
# ...
routing = ["openai", "alternative-provider"]
# ...
[embedding_models.model-name.providers.openai]
# ...
[embedding_models.model-name.providers.alternative-provider]
# ...
```
### `timeout_ms`
* **Type:** integer
* **Required:** no
The total time allowed (in milliseconds) for the embedding model to complete the request.
This timeout applies to the entire request, including all provider attempts in the routing list.
If a provider times out, the next provider in the routing list will be attempted.
If all providers timeout or the model-level timeout is reached, an error will be returned.
```toml title="tensorzero.toml" theme={null}
[embedding_models.model-name]
routing = ["openai"]
timeout_ms = 5000 # 5 second timeout
# ...
```
## `[embedding_models.model_name.providers.provider_name]`
The `providers` sub-section defines the behavior of a specific provider for a model.
You can define multiple providers by including multiple `[embedding_models.model_name.providers.provider_name]` sections.
If your `provider_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `vllm.internal` as `[embedding_models.model_name.providers."vllm.internal"]`.
```toml mark="openai" mark="azure" theme={null}
// tensorzero.toml
[embedding_models.model-name]
# ...
routing = ["openai", "alternative-provider"]
# ...
[embedding_models.model-name.providers.openai]
# ...
[embedding_models.model-name.providers.alternative-provider]
# ...
```
### `extra_body`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_body` field allows you to modify the request body that TensorZero sends to the embedding model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two fields:
* `pointer`: A [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) string specifying where to modify the request body
* Use `-` as the final path element to append to an array (e.g., `/messages/-` appends to `messages`)
* One of the following:
* `value`: The value to insert at that location; it can be of any type including nested types
* `delete = true`: Deletes the field at the specified location, if present.
You can also set `extra_body` at inference-time.
The values provided at inference-time take priority over the values in the configuration file.
```toml title="tensorzero.toml" theme={null}
[embedding_models.openai-text-embedding-3-small.providers.openai]
type = "openai"
extra_body = [
{ pointer = "/dimensions", value = 1536 }
]
```
### `timeout_ms`
* **Type:** integer
* **Required:** no
The total time allowed (in milliseconds) for this specific provider to complete the embedding request.
If the provider times out, the next provider in the routing list will be attempted (if any).
```toml title="tensorzero.toml" theme={null}
[embedding_models.model-name.providers.openai]
type = "openai"
timeout_ms = 3000 # 3 second timeout for this provider
# ...
```
### `type`
* **Type:** string
* **Required:** yes
Defines the types of the provider. See [Integrations » Model Providers](/integrations/model-providers) for details.
The other fields in the provider sub-section depend on the provider type.
```toml title="tensorzero.toml" theme={null}
[embedding_models.model-name.providers.openai]
# ...
type = "openai"
# ...
```
##### `api_base`
* **Type:** string
* **Required:** no (default: `https://api.openai.com/v1/`)
Defines the base URL of the OpenAI API.
You can use the `api_base` field to use an API provider that is compatible with the OpenAI API.
However, many providers are only "approximately compatible" with the OpenAI API, so you might need to use a specialized model provider in those cases.
```toml title="tensorzero.toml" theme={null}
[embedding_models.openai-text-embedding-3-small.providers.openai]
# ...
type = "openai"
api_base = "https://api.openai.com/v1/"
# ...
```
##### `api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::OPENAI_API_KEY`)
Defines the location of the API key for the OpenAI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE`, `dynamic::ARGUMENT_NAME`, and `none` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[embedding_models.openai-text-embedding-3-small.providers.openai]
# ...
type = "openai"
api_key_location = "dynamic::openai_api_key"
# api_key_location = "env::ALTERNATE_OPENAI_API_KEY"
# api_key_location = "none"
# api_key_location = { default = "dynamic::openai_api_key", fallback = "env::OPENAI_API_KEY" }
# ...
```
##### `model_name`
* **Type:** string
* **Required:** yes
Defines the model name to use with the OpenAI API.
See [OpenAI's documentation](https://platform.openai.com/docs/models/embeddings) for the list of available model names.
```toml title="tensorzero.toml" theme={null}
[embedding_models.openai-text-embedding-3-small.providers.openai]
# ...
type = "openai"
model_name = "text-embedding-3-small"
# ...
```
## `[provider_types]`
The `provider_types` section of the configuration allows users to specify global settings that are related to the handling of a particular inference provider type (like `"openai"` or `"anthropic"`), such as where to look by default for credentials.
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::ANTHROPIC_API_KEY`)
Defines the default location of the API key for Anthropic models.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.anthropic.defaults]
# ...
api_key_location = "dynamic::anthropic_api_key"
# api_key_location = "env::ALTERNATE_ANTHROPIC_API_KEY"
# api_key_location = { default = "dynamic::anthropic_api_key", fallback = "env::ANTHROPIC_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::AZURE_API_KEY`)
Defines the default location of the API key for Azure models.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.azure.defaults]
# ...
api_key_location = "dynamic::azure_api_key"
# api_key_location = { default = "dynamic::azure_api_key", fallback = "env::AZURE_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::DEEPSEEK_API_KEY`)
Defines the location of the API key for the DeepSeek provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.deepseek.defaults]
# ...
api_key_location = "dynamic::deepseek_api_key"
# api_key_location = { default = "dynamic::deepseek_api_key", fallback = "env::DEEPSEEK_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::FIREWORKS_API_KEY`)
Defines the location of the API key for the Fireworks provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.fireworks.defaults]
# ...
api_key_location = "dynamic::fireworks_api_key"
# api_key_location = { default = "dynamic::fireworks_api_key", fallback = "env::FIREWORKS_API_KEY" }
# ...
```
#### `sft`
* **Type:** object
* **Required:** no (default: `null`)
The `sft` object configures supervised fine-tuning for Fireworks models.
##### `account_id`
* **Type:** string
* **Required:** yes
Your Fireworks account ID, used for fine-tuning job management.
##### `defaults.credential_location`
* **Type:** string or object
* **Required:** no (default: `path_from_env::GCP_VERTEX_CREDENTIALS_PATH`)
Defines the location of the credentials for the GCP Vertex Anthropic provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::PATH_TO_CREDENTIALS_FILE`, `dynamic::CREDENTIALS_ARGUMENT_NAME`, `path::PATH_TO_CREDENTIALS_FILE`, and `path_from_env::ENVIRONMENT_VARIABLE` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.gcp_vertex_anthropic.defaults]
# ...
credential_location = "dynamic::gcp_credentials_path"
# credential_location = "path::/etc/secrets/gcp-key.json"
# credential_location = { default = "sdk", fallback = "path::/etc/secrets/gcp-key.json" }
# ...
```
#### `batch`
* **Type:** object
* **Required:** no (default: `null`)
The `batch` object allows you to configure batch processing for GCP Vertex models.
Today we support batch inference through GCP Vertex using Google cloud storage as documented [here](https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions#api:-cloud-storage).
To do this you must also have object\_storage (see the [object\_storage](#object_storage) section) configured using GCP.
```toml title="tensorzero.toml" theme={null}
[provider_types.gcp_vertex_gemini.batch]
storage_type = "cloud_storage"
input_uri_prefix = "gs://my-bucket/batch-inputs/"
output_uri_prefix = "gs://my-bucket/batch-outputs/"
```
The `batch` object supports the following configuration:
##### `storage_type`
* **Type:** string
* **Required:** no (default `"none"`)
Defines the storage type for batch processing. Currently, only `"cloud_storage"` and `"none"` are supported.
##### `input_uri_prefix`
* **Type:** string
* **Required:** yes when `storage_type` is `"cloud_storage"`
Defines the Google Cloud Storage URI prefix where batch input files will be stored.
##### `output_uri_prefix`
* **Type:** string
* **Required:** yes when `storage_type` is `"cloud_storage"`
Defines the Google Cloud Storage URI prefix where batch output files will be stored.
#### `sft`
* **Type:** object
* **Required:** no (default: `null`)
The `sft` object configures supervised fine-tuning for GCP Vertex Gemini models.
##### `bucket_name`
* **Type:** string
* **Required:** yes
The Google Cloud Storage bucket name for storing fine-tuning data.
##### `bucket_path_prefix`
* **Type:** string
* **Required:** no
Optional path prefix within the bucket for organizing fine-tuning data.
##### `kms_key_name`
* **Type:** string
* **Required:** no
Optional Cloud KMS key name for encrypting fine-tuning data.
##### `project_id`
* **Type:** string
* **Required:** yes
The GCP project ID where fine-tuning jobs will run.
##### `region`
* **Type:** string
* **Required:** yes
The GCP region for fine-tuning operations (e.g., `"us-central1"`).
##### `service_account`
* **Type:** string
* **Required:** no
Optional service account email for fine-tuning operations.
##### `defaults.credential_location`
* **Type:** string or object
* **Required:** no (default: `path_from_env::GCP_VERTEX_CREDENTIALS_PATH`)
Defines the location of the credentials for the GCP Vertex Gemini provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::PATH_TO_CREDENTIALS_FILE`, `dynamic::CREDENTIALS_ARGUMENT_NAME`, `path::PATH_TO_CREDENTIALS_FILE`, and `path_from_env::ENVIRONMENT_VARIABLE` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.gcp_vertex_gemini.defaults]
# ...
credential_location = "dynamic::gcp_credentials_path"
# credential_location = "path::/etc/secrets/gcp-key.json"
# credential_location = { default = "sdk", fallback = "path::/etc/secrets/gcp-key.json" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::GOOGLE_AI_STUDIO_API_KEY`)
Defines the location of the API key for the Google AI Studio provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.google_ai_studio.defaults]
# ...
api_key_location = "dynamic::google_ai_studio_api_key"
# api_key_location = { default = "dynamic::google_ai_studio_api_key", fallback = "env::GOOGLE_AI_STUDIO_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::GROQ_API_KEY`)
Defines the location of the API key for the Groq provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.groq.defaults]
# ...
api_key_location = "dynamic::groq_api_key"
# api_key_location = { default = "dynamic::groq_api_key", fallback = "env::GROQ_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::HYPERBOLIC_API_KEY`)
Defines the location of the API key for the Hyperbolic provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.hyperbolic.defaults]
# ...
api_key_location = "dynamic::hyperbolic_api_key"
# api_key_location = { default = "dynamic::hyperbolic_api_key", fallback = "env::HYPERBOLIC_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::MISTRAL_API_KEY`)
Defines the location of the API key for the Mistral provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.mistral.defaults]
# ...
api_key_location = "dynamic::mistral_api_key"
# api_key_location = { default = "dynamic::mistral_api_key", fallback = "env::MISTRAL_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::OPENAI_API_KEY`)
Defines the location of the API key for the OpenAI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.openai.defaults]
# ...
api_key_location = "dynamic::openai_api_key"
# api_key_location = { default = "dynamic::openai_api_key", fallback = "env::OPENAI_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::OPENROUTER_API_KEY`)
Defines the location of the API key for the OpenRouter provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.openrouter.defaults]
# ...
api_key_location = "dynamic::openrouter_api_key"
# api_key_location = { default = "dynamic::openrouter_api_key", fallback = "env::OPENROUTER_API_KEY" }
# ...
```
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::TOGETHER_API_KEY`)
Defines the location of the API key for the Together provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.together.defaults]
# ...
api_key_location = "dynamic::together_api_key"
# api_key_location = { default = "dynamic::together_api_key", fallback = "env::TOGETHER_API_KEY" }
# ...
```
#### `sft`
* **Type:** object
* **Required:** no (default: `null`)
The `sft` object configures supervised fine-tuning for Together models.
##### `hf_api_token`
* **Type:** string
* **Required:** no
Hugging Face API token for pushing fine-tuned models to the Hugging Face Hub.
##### `wandb_api_key`
* **Type:** string
* **Required:** no
Weights & Biases API key for experiment tracking during fine-tuning.
##### `wandb_base_url`
* **Type:** string
* **Required:** no
Custom Weights & Biases API base URL (for self-hosted instances).
##### `wandb_project_name`
* **Type:** string
* **Required:** no
Weights & Biases project name for organizing fine-tuning experiments.
##### `defaults.api_key_location`
* **Type:** string or object
* **Required:** no (default: `env::XAI_API_KEY`)
Defines the location of the API key for the xAI provider.
This field can be either a string for a single credential location, or an object with `default` and `fallback` fields for credential fallback support.
The supported locations are `env::ENVIRONMENT_VARIABLE` and `dynamic::ARGUMENT_NAME` (see [the API reference](/gateway/api-reference/inference/#credentials) and [Credential Management](/operations/manage-credentials/#configure-credential-fallbacks) for more details).
```toml title="tensorzero.toml" theme={null}
[provider_types.xai.defaults]
# ...
api_key_location = "dynamic::xai_api_key"
# api_key_location = { default = "dynamic::xai_api_key", fallback = "env::XAI_API_KEY" }
# ...
```
## `[functions.function_name]`
The `[functions.function_name]` section defines the behavior of a function.
You can define multiple functions by including multiple `[functions.function_name]` sections.
A function can have multiple variants, and each variant is defined in the `variants` sub-section (see below).
A function expresses the abstract behavior of an LLM call (e.g. the schemas for the messages), and its variants express concrete instantiations of that LLM call (e.g. specific templates and models).
If your `function_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `summarize-2.0` as `[functions."summarize-2.0"]`.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# fieldA = ...
# fieldB = ...
# ...
[functions.summarize-email]
# fieldA = ...
# fieldB = ...
# ...
```
### `assistant_schema`
* **Type:** string (path)
* **Required:** no
Defines the path to the assistant schema file.
The path is relative to the configuration file.
If provided, the assistant schema file should contain a JSON Schema for the assistant messages.
The variables in the schema are used for templating the assistant messages.
If a schema is provided, all function variants must also provide an assistant template (see below).
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
assistant_schema = "./functions/draft-email/assistant_schema.json"
# ...
[functions.draft-email.variants.prompt-v1]
# ...
assistant_template = "./functions/draft-email/prompt-v1/assistant_template.minijinja"
# ...
```
### `description`
* **Type:** string
* **Required:** no
Defines a description of the function.
In the future, this description will inform automated optimization recipes.
```toml title="tensorzero.toml" theme={null}
[functions.extract_data]
# ...
description = "Extract the sender's name (e.g. 'John Doe'), email address (e.g. 'john.doe@example.com'), and phone number (e.g. '+1234567890') from a customer's email."
# ...
```
### `system_schema`
* **Type:** string (path)
* **Required:** no
Defines the path to the system schema file.
The path is relative to the configuration file.
If provided, the system schema file should contain a JSON Schema for the system message.
The variables in the schema are used for templating the system message.
If a schema is provided, all function variants must also provide a system template (see below).
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
system_schema = "./functions/draft-email/system_schema.json"
# ...
[functions.draft-email.variants.prompt-v1]
# ...
system_template = "./functions/draft-email/prompt-v1/system_template.minijinja"
# ...
```
### `type`
* **Type:** string
* **Required:** yes
Defines the type of the function.
The supported function types are `chat` and `json`.
Most other fields in the function section depend on the function type.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
type = "chat"
# ...
```
##### `parallel_tool_calls`
* **Type:** boolean
* **Required:** no
Determines whether the function should be allowed to call multiple tools in a single conversation turn.
If not set, TensorZero will default to the model provider's default behavior.
Most model providers do not support this feature. In those cases, this field will be ignored.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
type = "chat"
parallel_tool_calls = true
# ...
```
##### `tool_choice`
* **Type:** string
* **Required:** no (default: `auto`)
Determines the tool choice strategy for the function.
The supported tool choice strategies are:
* `none`: The function should not use any tools.
* `auto`: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.
* `required`: The model should use a tool. If multiple tools are available, the model decides which tool to use.
* `{ specific = "tool_name" }`: The model should use a specific tool. The tool must be defined in the `tools` field (see below).
```toml mark="run-python" theme={null}
// tensorzero.toml
[functions.solve-math-problem]
# ...
type = "chat"
tool_choice = "auto"
tools = [
# ...
"run-python"
# ...
]
# ...
[tools.run-python]
# ...
```
```toml mark="query-database" theme={null}
// tensorzero.toml
[functions.generate-query]
# ...
type = "chat"
tool_choice = { specific = "query-database" }
tools = [
# ...
"query-database"
# ...
]
# ...
[tools.query-database]
# ...
```
##### `tools`
* **Type:** array of strings
* **Required:** no (default: `[]`)
Determines the tools that the function can use.
The supported tools are defined in `[tools.tool_name]` sections (see below).
```toml mark="query-database" theme={null}
// tensorzero.toml
[functions.draft-email]
# ...
type = "chat"
tools = [
# ...
"query-database"
# ...
]
# ...
[tools.query-database]
# ...
```
##### `output_schema`
* **Type:** string (path)
* **Required:** no (default: `{}`, the empty JSON schema that accepts any valid JSON output)
Defines the path to the output schema file, which should contain a JSON Schema for the output of the function.
The path is relative to the configuration file.
This schema is used for validating the output of the function.
```toml title="tensorzero.toml" theme={null}
[functions.extract-customer-info]
# ...
type = "json"
output_schema = "./functions/extract-customer-info/output_schema.json"
# ...
```
See [Generate structured outputs](/gateway/generate-structured-outputs) for a comprehensive guide with examples.
### `user_schema`
* **Type:** string (path)
* **Required:** no
Defines the path to the user schema file.
The path is relative to the configuration file.
If provided, the user schema file should contain a JSON Schema for the user messages.
The variables in the schema are used for templating the user messages.
If a schema is provided, all function variants must also provide a user template (see below).
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
user_schema = "./functions/draft-email/user_schema.json"
# ...
[functions.draft-email.variants.prompt-v1]
# ...
user_template = "./functions/draft-email/prompt-v1/user_template.minijinja"
# ...
```
## `[functions.function_name.variants.variant_name]`
The `variants` sub-section defines the behavior of a specific variant of a function.
You can define multiple variants by including multiple `[functions.function_name.variants.variant_name]` sections.
If your `variant_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `llama-3.1-8b-instruct` as `[functions.function_name.variants."llama-3.1-8b-instruct"]`.
```toml mark="draft-email" theme={null}
// tensorzero.toml
[functions.draft-email]
# ...
[functions.draft-email.variants."llama-3.1-8b-instruct"]
# ...
[functions.draft-email.variants.claude-haiku-4-5]
# ...
```
### `type`
* **Type:** string
* **Required:** yes
Defines the type of the variant.
TensorZero currently supports the following variant types:
| Type | Description |
| :----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `chat_completion` | Uses a chat completion model to generate responses by processing a series of messages in a conversational format. This is typically what you use out of the box with most LLMs. |
| `experimental_best_of_n` | Generates multiple response candidates with other variants, and selects the best one using an evaluator model. |
| `experimental_dynamic_in_context_learning` | Selects similar high-quality examples using an embedding of the input, and incorporates them into the prompt to enhance context and improve response quality. |
| `experimental_mixture_of_n` | Generates multiple response candidates with other variants, and combines the responses using a fuser model. |
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
type = "chat_completion"
# ...
```
##### `assistant_template`
* **Type:** string (path)
* **Required:** no
Defines the path to the assistant template file.
The path is relative to the configuration file.
This file should contain a MiniJinja template for the assistant messages.
If the template uses any variables, the variables should be defined in the function's `assistant_schema` field.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
assistant_schema = "./functions/draft-email/assistant_schema.json"
# ...
[functions.draft-email.variants.prompt-v1]
# ...
assistant_template = "./functions/draft-email/prompt-v1/assistant_template.minijinja"
# ...
```
##### `extra_body`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_body` field allows you to modify the request body that TensorZero sends to a variant's model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two fields:
* `pointer`: A [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) string specifying where to modify the request body
* Use `-` as the final path element to append to an array (e.g., `/messages/-` appends to `messages`)
* One of the following:
* `value`: The value to insert at that location; it can be of any type including nested types
* `delete = true`: Deletes the field at the specified location, if present.
You can also set `extra_body` for a model provider entry.
The model provider `extra_body` entries take priority over variant `extra_body` entries.
Additionally, you can set `extra_body` at inference-time.
The values provided at inference-time take priority over the values in the configuration file.
If TensorZero would normally send this request body to the provider...
```json theme={null}
{
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": true
}
}
```
...then the following `extra_body`...
```toml theme={null}
extra_body = [
{ pointer = "/agi", value = true},
{ pointer = "/safety_checks/no_agi", value = { bypass = "on" }}
]
```
...overrides the request body to:
```json theme={null}
{
"agi": true,
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": {
"bypass": "on"
}
}
}
```
##### `extra_headers`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_headers` field allows you to set or overwrite the request headers that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two fields:
* `name` (string): The name of the header to modify (e.g. `anthropic-beta`)
* One of the following:
* `value` (string): The value of the header (e.g. `token-efficient-tools-2025-02-19`)
* `delete = true`: Deletes the header from the request, if present
You can also set `extra_headers` for a model provider entry.
The model provider `extra_headers` entries take priority over variant `extra_headers` entries.
If TensorZero would normally send the following request headers to the provider...
```text theme={null}
Safety-Checks: on
```
...then the following `extra_headers`...
```toml theme={null}
extra_headers = [
{ name = "Safety-Checks", value = "off"},
{ name = "Intelligence-Level", value = "AGI"}
]
```
...overrides the request headers to:
```text theme={null}
Safety-Checks: off
Intelligence-Level: AGI
```
##### `frequency_penalty`
* **Type:** float
* **Required:** no (default: `null`)
Penalizes new tokens based on their frequency in the text so far if positive, encourages them if negative.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
frequency_penalty = 0.2
# ...
```
##### `json_mode`
* **Type:** string
* **Required:** yes for `json` functions, forbidden for `chat` functions
Defines the strategy for generating JSON outputs.
The supported modes are:
* `off`: Make a chat completion request without any special JSON handling (not recommended).
* `on`: Make a chat completion request with JSON mode (if supported by the provider).
* `strict`: Make a chat completion request with strict JSON mode (if supported by the provider). For example, the TensorZero Gateway uses Structured Outputs for OpenAI.
* `tool`: Make a special-purpose tool use request under the hood, and convert the tool call into a JSON response.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
json_mode = "strict"
# ...
```
See [Generate structured outputs](/gateway/generate-structured-outputs) for a comprehensive guide with examples.
##### `max_tokens`
* **Type:** integer
* **Required:** no (default: `null`)
Defines the maximum number of tokens to generate.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
max_tokens = 100
# ...
```
##### `model`
* **Type:** string
* **Required:** yes
The name of the model to call.
|
To call...
|
Use this format...
|
A model defined as \[models.my\_model] in your
tensorzero.toml
configuration file
|
model\_name="my\_model"
|
A model offered by a model provider, without defining it in your
tensorzero.toml configuration file (if supported, see
below)
|
`model_name="{provider_type}::{model_name}"`
|
The following model providers support short-hand model names: `anthropic`, `deepseek`, `fireworks`, `google_ai_studio_gemini`, `gcp_vertex_gemini`, `gcp_vertex_anthropic`, `hyperbolic`, `groq`, `mistral`, `openai`, `openrouter`, `together`, and `xai`.
For example, if you have the following configuration:
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o]
routing = ["openai", "azure"]
[models.gpt-4o.providers.openai]
# ...
[models.gpt-4o.providers.azure]
# ...
```
Then:
* `model = "gpt-4o"` calls the `gpt-4o` model in your configuration, which supports fallback from `openai` to `azure`. See [Retries & Fallbacks](/gateway/guides/retries-fallbacks/) for details.
* `model = "openai::gpt-4o"` calls the OpenAI API directly for the `gpt-4o` model using the Chat Completions API, ignoring the `gpt-4o` model defined above.
* `model = "openai::responses::gpt-5-codex"` calls the OpenAI Responses API directly for the `gpt-5-codex` model. See [OpenAI Responses API](/gateway/call-the-openai-responses-api/) for details.
##### `presence_penalty`
* **Type:** float
* **Required:** no (default: `null`)
Penalizes new tokens based on that have already appeared in the text so far if positive, encourages them if negative.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
presence_penalty = 0.5
# ...
```
##### `reasoning_effort`
* **Type:** string
* **Required:** no (default: `null`)
Controls the reasoning effort level for reasoning models.
For Gemini, this value corresponds to `generationConfig.thinkingConfig.thinkingLevel`.
Only some model providers support this parameter. TensorZero will warn and ignore it if unsupported.
Some providers (e.g. Anthropic) support `thinking_budget_tokens` instead.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
reasoning_effort = "medium"
# ...
```
##### `retries`
* **Type:** object with optional keys `num_retries` and `max_delay_s`
* **Required:** no (defaults to `num_retries = 0` and a `max_delay_s = 10`)
TensorZero's retry strategy is truncated exponential backoff with jitter.
The `num_retries` parameter defines the number of retries (not including the initial request).
The `max_delay_s` parameter defines the maximum delay between retries.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
retries = { num_retries = 3, max_delay_s = 10 }
# ...
```
##### `seed`
* **Type:** integer
* **Required:** no (default: `null`)
Defines the seed to use for the variant.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
seed = 42
```
##### `service_tier`
* **Type:** string
* **Required:** no (default: `"auto"`)
Controls the priority and latency characteristics of inference requests.
The supported values are:
* `auto`: Let the provider automatically select the appropriate service tier (default).
* `default`: Use the provider's standard service tier.
* `priority`: Use a higher-priority service tier with lower latency (may have higher costs).
* `flex`: Use a lower-priority service tier optimized for cost efficiency (may have higher latency).
Only some model providers support this parameter.
TensorZero will warn and ignore it if unsupported.
##### `stop_sequences`
* **Type:** array of strings
* **Required:** no (default: `null`)
Defines a list of sequences where the model will stop generating further tokens.
When the model encounters any of these sequences in its output, it will immediately stop generation.
##### `system_template`
* **Type:** string (path)
* **Required:** no
Defines the path to the system template file.
The path is relative to the configuration file.
This file should contain a MiniJinja template for the system messages.
If the template uses any variables, the variables should be defined in the function's `system_schema` field.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
system_schema = "./functions/draft-email/system_schema.json"
# ...
[functions.draft-email.variants.prompt-v1]
# ...
system_template = "./functions/draft-email/prompt-v1/system_template.minijinja"
# ...
```
##### `temperature`
* **Type:** float
* **Required:** no (default: `null`)
Defines the temperature to use for the variant.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
temperature = 0.5
# ...
```
##### `thinking_budget_tokens`
* **Type:** integer
* **Required:** no (default: `null`)
Controls the thinking budget in tokens for reasoning models.
For Anthropic, this value corresponds to `thinking.budget_tokens`.
For Gemini, this value corresponds to `generationConfig.thinkingConfig.thinkingBudget`.
Only some model providers support this parameter. TensorZero will warn and ignore it if unsupported.
Some providers (e.g. OpenAI) support `reasoning_effort` instead.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
thinking_budget_tokens = 10000
# ...
```
##### `timeouts`
* **Type:** object
* **Required:** no
The `timeouts` object allows you to set granular timeouts for requests using this variant.
You can define timeouts for non-streaming and streaming requests separately: `timeouts.non_streaming.total_ms` corresponds to the total request duration and `timeouts.streaming.ttft_ms` corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT):
```toml theme={null}
[functions.function_name.variants.variant_name]
# ...
timeouts = { non_streaming.total_ms = 15000, streaming.ttft_ms = 3000 }
# ...
```
The specified timeouts apply to the scope of an entire variant inference request, including all retries and fallbacks across its model's providers.
You can also set timeouts at the model level and provider level.
Multiple timeouts can be active simultaneously.
##### `top_p`
* **Type:** float, between 0 and 1
* **Required:** no (default: `null`)
Defines the `top_p` to use for the variant during [nucleus sampling](https://en.wikipedia.org/wiki/Top-p_sampling).
Typically at most one of `top_p` and `temperature` is set.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
top_p = 0.3
# ...
```
##### `verbosity`
* **Type:** string
* **Required:** no (default: `null`)
Controls the verbosity level of model outputs.
Only some model providers support this parameter. TensorZero will warn and ignore it if unsupported.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
verbosity = "low"
# ...
```
##### `user_template`
* **Type:** string (path)
* **Required:** no
Defines the path to the user template file.
The path is relative to the configuration file.
This file should contain a MiniJinja template for the user messages.
If the template uses any variables, the variables should be defined in the function's `user_schema` field.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
# ...
user_schema = "./functions/draft-email/user_schema.json"
# ...
[functions.draft-email.variants.prompt-v1]
# ...
user_template = "./functions/draft-email/prompt-v1/user_template.minijinja"
# ...
```
##### `candidates`
* **Type:** list of strings
* **Required:** yes
This inference strategy generates N candidate responses, and an evaluator model selects the best one.
This approach allows you to leverage multiple prompts or variants to increase the likelihood of getting a high-quality response.
The `candidates` parameter specifies a list of variant names used to generate candidate responses.
For example, if you have two variants defined (`promptA` and `promptB`), you could set up the `candidates` list to generate two responses using `promptA` and one using `promptB` using the snippet below.
The evaluator would then choose the best response from these three candidates.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.promptA]
type = "chat_completion"
# ...
[functions.draft-email.variants.promptB]
type = "chat_completion"
# ...
[functions.draft-email.variants.best-of-n]
type = "experimental_best_of_n"
candidates = ["promptA", "promptA", "promptB"] # 3 candidate generations
# ...
```
##### `evaluator`
* **Type:** object
* **Required:** yes
The `evaluator` parameter specifies the configuration for the model that will evaluate and select the best response from the generated candidates.
The evaluator is configured similarly to a `chat_completion` variant for a JSON function, but without the `type` field.
The prompts here should be prompts that you would use to solve the original problem, as the gateway has special-purpose handling and templates to convert them to an evaluator.
The evaluator can optionally include a `json_mode` parameter (see the `json_mode` documentation under `chat_completion` variants). If not specified, it defaults to `strict`.
```toml theme={null}
[functions.draft-email.variants.best-of-n]
type = "experimental_best_of_n"
# ...
[functions.draft-email.variants.best-of-n.evaluator]
# Same fields as a `chat_completion` variant (excl.`type`), e.g.:
# user_template = "functions/draft-email/best-of-n/user.minijinja"
# ...
```
##### `timeouts`
* **Type:** object
* **Required:** no
The `timeouts` object allows you to set granular timeouts for requests using this variant.
You can define timeouts for non-streaming and streaming requests separately: `timeouts.non_streaming.total_ms` corresponds to the total request duration and `timeouts.streaming.ttft_ms` corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT):
```toml theme={null}
[functions.function_name.variants.variant_name]
# ...
timeouts = { non_streaming.total_ms = 15000, streaming.ttft_ms = 3000 }
# ...
```
The specified timeouts apply to the scope of an entire variant inference request, including all inference requests to candidates and the evaluator.
You can also set timeouts at the model level and provider level.
Multiple timeouts can be active simultaneously.
##### `candidates`
* **Type:** list of strings
* **Required:** yes
This inference strategy generates N candidate responses, and a fuser model combines them to produce a final answer.
This approach allows you to leverage multiple prompts or variants to increase the likelihood of getting a high-quality response.
The `candidates` parameter specifies a list of variant names used to generate candidate responses.
For example, if you have two variants defined (`promptA` and `promptB`), you could set up the `candidates` list to generate two responses using `promptA` and one using `promptB` using the snippet below.
The fuser would then combine the three responses.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.promptA]
type = "chat_completion"
# ...
[functions.draft-email.variants.promptB]
type = "chat_completion"
# ...
[functions.draft-email.variants.mixture-of-n]
type = "experimental_mixture_of_n"
candidates = ["promptA", "promptA", "promptB"] # 3 candidate generations
# ...
```
##### `fuser`
* **Type:** object
* **Required:** yes for `json` functions, forbidden for `chat` functions
The `fuser` parameter specifies the configuration for the model that will evaluate and combine the elements.
The fuser is configured similarly to a `chat_completion` variant, but without the `type` field.
The prompts here should be prompts that you would use to solve the original problem, as the gateway has special-purpose handling and templates to convert them to a fuser.
```toml theme={null}
[functions.draft-email.variants.mixture-of-n]
type = "experimental_mixture_of_n"
# ...
[functions.draft-email.variants.mixture-of-n.fuser]
# Same fields as a `chat_completion` variant (excl.`type`), e.g.:
# user_template = "functions/draft-email/mixture-of-n/user.minijinja"
# ...
```
##### `timeouts`
* **Type:** object
* **Required:** no
The `timeouts` object allows you to set granular timeouts for requests using this variant.
You can define timeouts for non-streaming and streaming requests separately: `timeouts.non_streaming.total_ms` corresponds to the total request duration and `timeouts.streaming.ttft_ms` corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT):
```toml theme={null}
[functions.function_name.variants.variant_name]
# ...
timeouts = { non_streaming.total_ms = 15000, streaming.ttft_ms = 3000 }
# ...
```
The specified timeouts apply to the scope of an entire variant inference request, including all inference requests to candidates and the fuser.
You can also set timeouts at the model level and provider level.
Multiple timeouts can be active simultaneously.
##### `embedding_model`
* **Type:** string
* **Required:** yes
The name of the embedding model to call.
|
To call...
|
Use this format...
|
A model defined as \[models.my\_model] in your
tensorzero.toml
configuration file
|
model\_name="my\_model"
|
A model offered by a model provider, without defining it in your
tensorzero.toml configuration file (if supported, see
below)
|
`model_name="{provider_type}::{model_name}"`
|
The following model providers support short-hand model names: `anthropic`, `deepseek`, `fireworks`, `google_ai_studio_gemini`, `gcp_vertex_gemini`, `gcp_vertex_anthropic`, `hyperbolic`, `groq`, `mistral`, `openai`, `openrouter`, `together`, and `xai`.
For example, if you have the following configuration:
```toml title="tensorzero.toml" theme={null}
[embedding_models.text-embedding-3-small]
#...
[embedding_models.text-embedding-3-small.providers.openai]
# ...
[embedding_models.text-embedding-3-small.providers.azure]
# ...
```
Then:
* `embedding_model = "text-embedding-3-small"` calls the `text-embedding-3-small` model in your configuration.
* `embedding_model = "openai::text-embedding-3-small"` calls the OpenAI API directly for the `text-embedding-3-small` model, ignoring the `text-embedding-3-small` model defined above.
##### `extra_body`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_body` field allows you to modify the request body that TensorZero sends to a variant's model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
For `experimental_dynamic_in_context_learning` variants, `extra_body` only applies to the chat completion request.
Each object in the array must have two fields:
* `pointer`: A [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) string specifying where to modify the request body
* Use `-` as the final path element to append to an array (e.g., `/messages/-` appends to `messages`)
* One of the following:
* `value`: The value to insert at that location; it can be of any type including nested types
* `delete = true`: Deletes the field at the specified location, if present.
You can also set `extra_body` for a model provider entry.
The model provider `extra_body` entries take priority over variant `extra_body` entries.
Additionally, you can set `extra_body` at inference-time.
The values provided at inference-time take priority over the values in the configuration file.
If TensorZero would normally send this request body to the provider...
```json theme={null}
{
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": true
}
}
```
...then the following `extra_body`...
```toml theme={null}
extra_body = [
{ pointer = "/agi", value = true},
{ pointer = "/safety_checks/no_agi", value = { bypass = "on" }}
]
```
...overrides the request body to:
```json theme={null}
{
"agi": true,
"project": "tensorzero",
"safety_checks": {
"no_internet": false,
"no_agi": {
"bypass": "on"
}
}
}
```
##### `extra_headers`
* **Type:** array of objects (see below)
* **Required:** no
The `extra_headers` field allows you to set or overwrite the request headers that TensorZero sends to a model provider.
This advanced feature is an "escape hatch" that lets you use provider-specific functionality that TensorZero hasn't implemented yet.
Each object in the array must have two fields:
* `name` (string): The name of the header to modify (e.g. `anthropic-beta`)
* One of the following:
* `value` (string): The value of the header (e.g. `token-efficient-tools-2025-02-19`)
* `delete = true`: Deletes the header from the request, if present
You can also set `extra_headers` for a model provider entry.
The model provider `extra_headers` entries take priority over variant `extra_headers` entries.
If TensorZero would normally send the following request headers to the provider...
```text theme={null}
Safety-Checks: on
```
...then the following `extra_headers`...
```toml theme={null}
extra_headers = [
{ name = "Safety-Checks", value = "off"},
{ name = "Intelligence-Level", value = "AGI"}
]
```
...overrides the request headers to:
```text theme={null}
Safety-Checks: off
Intelligence-Level: AGI
```
##### `json_mode`
* **Type:** string
* **Required:** yes for `json` functions, forbidden for `chat` functions
Defines the strategy for generating JSON outputs.
The supported modes are:
* `off`: Make a chat completion request without any special JSON handling (not recommended).
* `on`: Make a chat completion request with JSON mode (if supported by the provider).
* `strict`: Make a chat completion request with strict JSON mode (if supported by the provider). For example, the TensorZero Gateway uses Structured Outputs for OpenAI.
* `tool`: Make a special-purpose tool use request under the hood, and convert the tool call into a JSON response.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
json_mode = "strict"
# ...
```
##### `k`
* **Type:** non-negative integer
* **Required:** yes
Defines the number of examples to retrieve for the inference.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.dicl]
# ...
k = 10
# ...
```
##### `max_distance`
* **Type:** non-negative float
* **Required:** no (default: none)
Filters retrieved examples based on their cosine distance from the input embedding.
Only examples with a cosine distance less than or equal to the specified threshold are included in the prompt.
If all examples are filtered out due to this threshold, the variant falls back to default chat completion behavior.
##### `max_tokens`
* **Type:** integer
* **Required:** no (default: `null`)
Defines the maximum number of tokens to generate.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
max_tokens = 100
# ...
```
##### `model`
* **Type:** string
* **Required:** yes
The name of the model to call.
|
To call...
|
Use this format...
|
A model defined as \[models.my\_model] in your
tensorzero.toml
configuration file
|
model\_name="my\_model"
|
A model offered by a model provider, without defining it in your
tensorzero.toml configuration file (if supported, see
below)
|
`model_name="{provider_type}::{model_name}"`
|
The following model providers support short-hand model names: `anthropic`, `deepseek`, `fireworks`, `google_ai_studio_gemini`, `gcp_vertex_gemini`, `gcp_vertex_anthropic`, `hyperbolic`, `groq`, `mistral`, `openai`, `openrouter`, `together`, and `xai`.
For example, if you have the following configuration:
```toml title="tensorzero.toml" theme={null}
[models.gpt-4o]
routing = ["openai", "azure"]
[models.gpt-4o.providers.openai]
# ...
[models.gpt-4o.providers.azure]
# ...
```
Then:
* `model = "gpt-4o"` calls the `gpt-4o` model in your configuration, which supports fallback from `openai` to `azure`. See [Retries & Fallbacks](/gateway/guides/retries-fallbacks/) for details.
* `model = "openai::gpt-4o"` calls the OpenAI API directly for the `gpt-4o` model using the Chat Completions API, ignoring the `gpt-4o` model defined above.
* `model = "openai::responses::gpt-5-codex"` calls the OpenAI Responses API directly for the `gpt-5-codex` model. See [OpenAI Responses API](/gateway/call-the-openai-responses-api/) for details.
##### `retries`
* **Type:** object with optional keys `num_retries` and `max_delay_s`
* **Required:** no (defaults to `num_retries = 0` and a `max_delay_s = 10`)
TensorZero's retry strategy is truncated exponential backoff with jitter.
The `num_retries` parameter defines the number of retries (not including the initial request).
The `max_delay_s` parameter defines the maximum delay between retries.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
retries = { num_retries = 3, max_delay_s = 10 }
# ...
```
##### `seed`
* **Type:** integer
* **Required:** no (default: `null`)
Defines the seed to use for the variant.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
seed = 42
```
##### `system_instructions`
* **Type:** string (path)
* **Required:** no
Defines the path to the system instructions file.
The path is relative to the configuration file.
The system instruction is a text file that will be added to the evaluator's system prompt.
Unlike `system_template`, it doesn't support variables.
This file contains static instructions that define the behavior and role of the AI assistant for the specific function variant.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.dicl]
# ...
system_instructions = "./functions/draft-email/prompt-v1/system_template.txt"
# ...
```
##### `temperature`
* **Type:** float
* **Required:** no (default: `null`)
Defines the temperature to use for the variant.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.variants.prompt-v1]
# ...
temperature = 0.5
# ...
```
##### `timeouts`
* **Type:** object
* **Required:** no
The `timeouts` object allows you to set granular timeouts for requests using this variant.
You can define timeouts for non-streaming and streaming requests separately: `timeouts.non_streaming.total_ms` corresponds to the total request duration and `timeouts.streaming.ttft_ms` corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT):
```toml theme={null}
[functions.function_name.variants.variant_name]
# ...
timeouts = { non_streaming.total_ms = 15000, streaming.ttft_ms = 3000 }
# ...
```
The specified timeouts apply to the scope of an entire variant inference request, including both inference requests to the embedding model and the generation model.
You can also set timeouts at the model level and provider level.
Multiple timeouts can be active simultaneously.
## `[functions.function_name.experimentation]`
This section configures experimentation (A/B testing) over a set of variants in a function.
At inference time, the gateway will sample a variant from the function to complete the request.
By default, the gateway will sample a variant uniformly at random (`type = "uniform"`).
TensorZero supports multiple types of experiments that can help you learn about the relative performance of the variants.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# fieldA = ...
# fieldB = ...
# ...
```
### `type`
* **Type:** string
* **Required:** yes
Determines the experiment type.
TensorZero currently supports the following experiment types:
| Type | Description |
| :--------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `uniform` | Samples variants uniformly at random. For example, if there are three candidate variants, each will be sampled with probability `1/3`. |
| `static_weights` | Samples variants according to user-specified weights. Weights must be nonnegative and are normalized to sum to 1. See the `candidate_variants` documentation below for how to specify weights. |
| `track_and_stop` | Samples variants according to probabilities that dynamically update based on accumulating feedback data. Designed to maximize experiment efficiency by minimizing the number of inferences needed to identify the best variant. |
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
type = "track_and_stop"
# ...
```
The `uniform` type samples variants uniformly at random.
This is the default behavior when no `[functions.function_name.experimentation]` section is specified.
By default, all variants defined in the function are sampled with equal probability.
You can optionally specify `candidate_variants` to sample uniformly from a subset of variants, and `fallback_variants` for sequential fallback behavior.
The behavior depends on which fields are specified:
| Configuration | Behavior |
| :------------------------ | :-------------------------------------------------------------------------- |
| No fields specified | Samples uniformly from all variants in the function |
| Only `candidate_variants` | Samples uniformly from specified candidates |
| Only `fallback_variants` | Uses fallback variants sequentially (no uniform sampling) |
| Both specified | Samples uniformly from candidates; if all fail, uses fallbacks sequentially |
### `candidate_variants`
* **Type:** array of strings
* **Required:** no
An optional set of variants to sample uniformly from.
Each variant must be defined via `[functions.function_name.variants.variant_name]` in the `variants` sub-section.
If not specified (and `fallback_variants` is also not specified), all variants are sampled uniformly.
If `fallback_variants` is specified but `candidate_variants` is not, no candidates are used (fallback-only mode).
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
type = "uniform"
candidate_variants = ["variant-a", "variant-b"]
```
### `fallback_variants`
* **Type:** array of strings
* **Required:** no
An optional set of function variants to use as fallback options.
Each variant must be defined via `[functions.function_name.variants.variant_name]` in the `variants` sub-section.
If all candidate variants fail during inference, the gateway will select variants sequentially from `fallback_variants` (in order, not uniformly).
This behaves like a ranked list where the first active fallback variant is always selected.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
type = "uniform"
candidate_variants = ["variant-a", "variant-b"]
fallback_variants = ["fallback-variant"]
```
### Examples
**Default uniform sampling (all variants):**
```toml title="tensorzero.toml" theme={null}
[functions.draft-email]
type = "chat"
[functions.draft-email.variants.variant-a] # 1/3 chance
# ...
[functions.draft-email.variants.variant-b] # 1/3 chance
# ...
[functions.draft-email.variants.variant-c] # 1/3 chance
# ...
```
**Explicit candidate variants:**
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
type = "uniform"
candidate_variants = ["variant-a", "variant-b"] # each has 1/2 probability
# `variant-c` will not be sampled
```
**With fallback variants:**
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
type = "uniform"
candidate_variants = ["variant-a", "variant-b"] # try these first, uniformly
fallback_variants = ["variant-c"] # use if both candidates fail
```
**Fallback-only mode:**
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
type = "uniform"
fallback_variants = ["variant-a", "variant-b", "variant-c"] # sequential
```
The `static_weights` type samples variants according to user-specified weights.
This allows you to control the distribution of traffic across variants with fixed probabilities.
### `candidate_variants`
* **Type:** map of strings to floats
* **Required:** yes
A map from variant names to their sampling weights.
Each variant must be defined via `[functions.function_name.variants.variant_name]` in the `variants` sub-section.
Weights must be non-negative.
The gateway automatically normalizes the weights to sum to 1.0.
For example, weights of `{"variant-a" = 5.0, "variant-b" = 1.0}` result in sampling probabilities of `5/6` and `1/6` respectively.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
type = "static_weights"
candidate_variants = {"prompt-v1" = 5.0, "prompt-v2" = 1.0}
# ...
```
### `fallback_variants`
* **Type:** array of strings
* **Required:** no
An optional set of function variants to use as fallback options.
Each variant must be defined via `[functions.function_name.variants.variant_name]` in the `variants` sub-section.
If all candidate variants fail during inference, or if the total weight of active candidate variants is zero, the gateway will sample uniformly at random from `fallback_variants`.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
type = "static_weights"
candidate_variants = {"prompt-v1" = 2.0, "prompt-v2" = 1.0, "prompt-v3" = 0.5}
fallback_variants = ["fallback-prompt-a", "fallback-prompt-b"]
```
### `candidate_variants`
* **Type:** array of strings
* **Required:** yes
The set of function variants to include in the experiment.
Each variant must be defined via `[functions.function_name.variants.variant_name]` in the the `variants` sub-section (see above).
Variants that are not included in `candidate_variants` will not be sampled.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
candidate_variants = ["prompt-v1", "prompt-v2", "prompt-v3"]
# ...
```
##### `delta`
* **Type:** float
* **Required:** no (default: 0.05)
This field is for advanced users. The default value is sensible for most use cases.
The error tolerance.
The value of `delta` must be a probability in the `(0, 1)` range.
In simple terms, `delta` is the probability that the algorithm will incorrectly identify a variant as the winner.
A commonly used value in experimentation settings is `0.05`, which caps the probability that an epsilon-best variant is not chosen as the winner at 5%.
The `track_and_stop` algorithm aims to identify a "winner" variant that has the best average value for the chosen metric, or nearly the best (where "best" means highest if `optimize = "max"` or lowest if `optimize = "min"` for the chosen metric, and "nearly" is determined by a tolerance `epsilon`, defined below).
Once this variant is identified, random sampling ceases and the winner variant is used exclusively going forward.
The value `delta` instantiates a trade-off between the speed of identification and the confidence in the identified variant.
The smaller the value of `delta`, the higher the chance that the algorithm will correctly identify an epsilon-best variant, and the more data required to do so.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
delta = 0.05
# ...
```
##### `epsilon`
* **Type:** float
* **Required:** no (default: 0.0)
This field is for advanced users. The default value is sensible for most use cases.
The sub-optimality tolerance.
The value must be nonnegative.
The `track_and_stop` algorithm aims to identify a "winner" variant whose average metric value is either the highest, or within epsilon of the highest.
Larger values of `epsilon` allow the algorithm to label a winner more quickly.
As an example, consider an experiment over three function variants with underlying (unknown) mean metric values of `[0.6, 0.8, 0.85]` for a metric with `optimize = "max"`.
If `delta = 0.05` and `epsilon = 0.05`, then the algorithm will label either the second or third variant as the winner with probability at least `1 - delta = 95%`.
If `delta = 0.05` and `epsilon = 0`, then the experiment will run longer and the algorithm will label the third variant as the winner with probability at least `95%`.
If `delta = 0.01` and `epsilon = 0`, then the experiment will run for even longer, and the algorithm will label the third variant as the winner with probability at least 99%.
It is always possible to set `epsilon = 0` to insist on identifying the strictly best variant with high probability.
Reasonable nonzero values of `epsilon` depend on the scale of the chosen metric.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
epsilon = 0.03
# ...
```
### `fallback_variants`
* **Type:** array of string
* **Required:** no
An optional set of function variants to use as fallback options.
Each variant must be defined via `[functions.function_name.variants.variant_name]` in the the `variants` sub-section (see above).
If inference fails with all of the `candidate_variants`, then variants will be sampled uniformly at random from `fallback_variants`.
Feedback for these variants will not be used in the experiment itself; for example, if the experiment type is `track_and_stop`, the sampling probabilities will be dynamically updated based only on feedback for the `candidate_variants`.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
candidate_variants = ["prompt-v1", "prompt-v2", "prompt-v3"]
fallback_variants = ["fallback-prompt-a", "fallback-prompt-b"]
# ...
```
##### `metric`
* **Type:** string
* **Required:** yes
The metric that should be tracked during the experiment.
The metric is used to dynamically update the sampling probabilities for the variants in a way that is designed to quickly identify high performing variants.
This must be one of the metrics defined in the `[metrics]` section.
`track_and_stop` can handle both inference-level and episode-level metrics.
Plots based on the chosen metric are displayed in the `Experimentation` section of the `Functions` tab in the TensorZero UI.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
metric = "task-completed"
# ...
```
##### `min_prob`
* **Type:** float
* **Required:** no (default: `0`)
This field is for advanced users. The default value is sensible for most use cases.
The minimum sampling probability for each candidate variant.
The value must be nonnegative.
Note that `min_prob` times the number of `candidate_variants` must not exceed 1.0, since the minimum probabilities for all candidate variants must sum to at most 1.0.
The aim of a `track_and_stop` experiment is to identify an epsilon-best variant, without necessarily differentiating sub-optimal variants, so the primary use for this field is to enable the user to ensure that sufficient data is gathered to learn about the performance of sub-optimal variants.
Note that this field has no effect once `track_and_stop` picks a winner variant, since at that point random sampling ceases and the winner variant is used exclusively.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
min_prob = 0.05
# ...
```
##### `min_samples_per_variant`
* **Type:** integer
* **Required:** no (default: 10)
This field is for advanced users. The default value is sensible for most use cases.
The minimum number of samples per variant required before random sampling begins.
The value must be greater than or equal to 1.
Sampling from the `candidate_variants` will proceed round-robin (deterministically) until each variant has at least `min_samples_per_variant` feedback data points, at which point random sampling will begin.
It is strongly recommended to set this value to at least 10 so that the feedback sample statistics can stabilize before they are used to guide the sampling probabilities.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
min_samples_per_variant = 10
# ...
```
##### `update_period_s`
* **Type:** integer
* **Required:** no (default: 300)
This field is for advanced users. The default value is sensible for most use cases.
The frequency, in seconds, with which sampling probabilities are updated.
Lower values will lead to faster experiment convergence but will consume more computational resources.
Updating the sampling probabilities requires reading the latest feedback data from ClickHouse.
This is accomplished by a background task that interacts with the gateway instance.
More frequent updates (smaller values of `update_period_s`) relative to the feedback throughput enable the algorithm to more quickly guide the sampling probabilities toward their theoretical optimum, which allows it to more quickly label the "winner" variant.
For example, updating the sampling probabilities every \~100 inferences should lead to faster convergence than updating them every \~500 inferences.
```toml title="tensorzero.toml" theme={null}
[functions.draft-email.experimentation]
# ...
update_period_s = 300
# ...
```
## `[metrics]`
The `[metrics]` section defines the behavior of a metric.
You can define multiple metrics by including multiple `[metrics.metric_name]` sections.
The metric name can't be `comment` or `demonstration`, as those names are reserved for internal use.
If your `metric_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `beats-gpt-4.1` as `[metrics."beats-gpt-4.1"]`.
```toml title="tensorzero.toml" theme={null}
[metrics.task-completed]
# fieldA = ...
# fieldB = ...
# ...
[metrics.user-rating]
# fieldA = ...
# fieldB = ...
# ...
```
### `level`
* **Type:** string
* **Required:** yes
Defines whether the metric applies to individual inference or across entire episodes.
The supported levels are `inference` and `episode`.
```toml title="tensorzero.toml" theme={null}
[metrics.valid-output]
# ...
level = "inference"
# ...
[metrics.task-completed]
# ...
level = "episode"
# ...
```
### `optimize`
* **Type:** string
* **Required:** yes
Defines whether the metric should be maximized or minimized.
The supported values are `max` and `min`.
```toml title="tensorzero.toml" theme={null}
[metrics.mistakes-made]
# ...
optimize = "min"
# ...
[metrics.user-rating]
# ...
optimize = "max"
# ...
```
### `type`
* **Type:** string
* **Required:** yes
Defines the type of the metric.
The supported metric types are `boolean` and `float`.
```toml title="tensorzero.toml" theme={null}
[metrics.user-rating]
# ...
type = "float"
# ...
[metrics.task-completed]
# ...
type = "boolean"
# ...
```
## `[tools.tool_name]`
The `[tools.tool_name]` section defines the behavior of a tool.
You can define multiple tools by including multiple `[tools.tool_name]` sections.
If your `tool_name` is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define `run-python-3.10` as `[tools."run-python-3.10"]`.
You can enable a tool for a function by adding it to the function's `tools` field.
```toml mark="get-temperature" theme={null}
// tensorzero.toml
[functions.weather-chatbot]
# ...
type = "chat"
tools = [
# ...
"get-temperature"
# ...
]
# ...
[tools.get-temperature]
# ...
```
### `description`
* **Type:** string
* **Required:** yes
Defines the description of the tool provided to the model.
You can typically materially improve the quality of responses by providing a detailed description of the tool.
```toml title="tensorzero.toml" theme={null}
[tools.get-temperature]
# ...
description = "Get the current temperature in a given location (e.g. \"Tokyo\") using the specified unit (must be \"celsius\" or \"fahrenheit\")."
# ...
```
### `parameters`
* **Type:** string (path)
* **Required:** yes
Defines the path to the parameters file.
The path is relative to the configuration file.
This file should contain a JSON Schema for the parameters of the tool.
```toml title="tensorzero.toml" theme={null}
[tools.get-temperature]
# ...
parameters = "./tools/get-temperature.json"
# ...
```
### `strict`
* **Type:** boolean
* **Required:** no (default: `false`)
If set to `true`, the TensorZero Gateway attempts to use strict JSON generation for the tool parameters.
This typically improves the quality of responses.
Only a few providers support strict JSON generation.
For example, the TensorZero Gateway uses Structured Outputs for OpenAI.
If the provider does not support strict mode, the TensorZero Gateway ignores this field.
```toml title="tensorzero.toml" theme={null}
[tools.get-temperature]
# ...
strict = true
# ...
```
### `name`
* **Type:** string
* **Required:** no (defaults to the tool ID)
Defines the tool name to be sent to model providers.
By default, TensorZero will use the tool ID in the configuration as the tool name sent to model providers.
For example, if you define a tool as `[tools.my_tool]` but don't specify the `name`, the name will be `my_tool`.
This field allows you to specify a different name to be sent.
This field is particularly useful if you want to define multiple tools that share the same name (e.g. for different functions).
At inference time, the gateway ensures that an inference request doesn't have multiple tools with the same name.
## `[object_storage]`
The `[object_storage]` section defines the behavior of object storage, which is used for storing images used during multimodal inference.
### `type`
* **Type:** string
* **Required:** yes
Defines the type of object storage to use.
The supported types are:
* `s3_compatible`: Use an S3-compatible object storage service.
* `filesystem`: Store images in a local directory.
* `disabled`: Disable object storage.
See the following sections for more details on each type.
If you set `type = "s3_compatible"`, TensorZero will use an S3-compatible object storage service to store and retrieve images.
The TensorZero Gateway will attempt to retrieve credentials from the following resources in order of priority:
1. `S3_ACCESS_KEY_ID` and `S3_SECRET_ACCESS_KEY` environment variables
2. `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
3. Credentials from the AWS SDK (default profile)
If you set `type = "s3_compatible"`, the following fields are available.
##### `endpoint`
* **Type:** string
* **Required:** no (defaults to AWS S3)
Defines the endpoint of the object storage service.
You can use this field to specify a custom endpoint for the object storage service (e.g. GCP Cloud Storage, Cloudflare R2, and many more).
##### `bucket_name`
* **Type:** string
* **Required:** no
Defines the name of the bucket to use for object storage.
You should provide a bucket name unless it's specified in the `endpoint` field.
##### `region`
* **Type:** string
* **Required:** no
Defines the region of the object storage service (if applicable).
This is required for some providers (e.g. AWS S3).
If the provider does not require a region, this field can be omitted.
##### `allow_http`
* **Type:** boolean
* **Required:** no (defaults to `false`)
Normally, the TensorZero Gateway will require HTTPS to access the object storage service.
If set to `true`, the TensorZero Gateway will instead use HTTP to access the object storage service.
This is useful for local development (e.g. a local MinIO deployment), but not recommended for production environments.
For production environments, we strongly recommend you disable the `allow_http` setting and use a secure method of authentication in combination with a production-grade object storage service.
##### `path`
* **Type:** string
* **Required:** yes
Defines the path to the directory to use for object storage.
If you set `type = "disabled"`, the TensorZero Gateway will not store or retrieve images.
There are no additional fields available for this type.
## `[postgres]`
The `[postgres]` section defines the configuration for PostgreSQL connectivity.
PostgreSQL is required for certain TensorZero features including [adaptive experimentation](/experimentation/run-adaptive-ab-tests/) and [authentication](/operations/set-up-auth-for-tensorzero).
PostgreSQL can also be used for [rate limiting](/operations/enforce-custom-rate-limits/), though [Valkey](/deployment/valkey-redis) is recommended for high-throughput deployments.
You can connect to PostgreSQL by setting the `TENSORZERO_POSTGRES_URL` environment variable.
### `connection_pool_size`
* **Type:** integer
* **Required:** no (default: `20`)
Defines the maximum number of connections in the PostgreSQL connection pool.
### `enabled`
* **Type:** boolean
* **Required:** no (default: `null`)
Enable PostgreSQL connectivity.
If `true`, the gateway will throw an error on startup if it fails to connect to PostgreSQL (requires `TENSORZERO_POSTGRES_URL` environment variable).
If `false`, the gateway will not use PostgreSQL even if the `TENSORZERO_POSTGRES_URL` environment variable is set.
If omitted, the gateway will connect to PostgreSQL if the `TENSORZERO_POSTGRES_URL` environment variable is set, otherwise it will disable PostgreSQL with a warning.
If you have features that require PostgreSQL (e.g. adaptive experimentation, authentication) configured but set `postgres.enabled = false` or don't provide the `TENSORZERO_POSTGRES_URL` environment variable, the gateway will fail to start with a configuration error.
## `[rate_limiting]`
The `[rate_limiting]` section allows you to configure granular rate limits for your TensorZero Gateway.
Rate limits help you control usage, manage costs, and prevent abuse.
See [Enforce Custom Rate Limits](/operations/enforce-custom-rate-limits/) for a comprehensive guide on rate limiting.
### `enabled`
* **Type:** boolean
* **Required:** no (default: `true`)
Enable or disable rate limiting enforcement.
When set to `false`, rate limiting rules will not be enforced even if they are defined.
```toml theme={null}
[rate_limiting]
enabled = true
```
### `[[rate_limiting.rules]]`
Rate limiting rules are defined as an array of rule configurations.
Each rule specifies rate limits for specific resources (model inferences, tokens), time windows, scopes, and priorities.
#### Rate Limit Fields
You can set rate limits for different resources and time windows using the following field formats:
* `model_inferences_per_second`
* `model_inferences_per_minute`
* `model_inferences_per_hour`
* `model_inferences_per_day`
* `model_inferences_per_week`
* `model_inferences_per_month`
* `tokens_per_second`
* `tokens_per_minute`
* `tokens_per_hour`
* `tokens_per_day`
* `tokens_per_week`
* `tokens_per_month`
Each rate limit field can be specified in two formats:
**Simple Format:** A single integer value that sets both the capacity and refill rate to the same value.
```toml theme={null}
[[rate_limiting.rules]]
model_inferences_per_minute = 100
tokens_per_hour = 10000
```
**Bucket Format:** An object with explicit `capacity` and `refill_rate` fields for fine-grained control over the token bucket algorithm.
```toml theme={null}
[[rate_limiting.rules]]
tokens_per_minute = { capacity = 1000, refill_rate = 500 }
```
The simple format is equivalent to setting `capacity` and `refill_rate` to the same value.
The bucket format allows you to configure burst capacity independently from the sustained rate.
#### `priority`
* **Type:** integer
* **Required:** yes (unless `always` is set to `true`)
Defines the priority of the rule.
When multiple rules match a request, only the rules with the highest priority value are applied.
```toml theme={null}
[[rate_limiting.rules]]
model_inferences_per_minute = 10
priority = 1
```
#### `always`
* **Type:** boolean
* **Required:** no (mutually exclusive with `priority`)
When set to `true`, this rule will always be applied regardless of priority.
This is useful for global fallback limits.
You cannot specify both `always` and `priority` in the same rule.
```toml theme={null}
[[rate_limiting.rules]]
tokens_per_hour = 1000000
always = true
```
#### `scope`
* **Type:** array of scope objects
* **Required:** no (default: `[]`)
Defines the scope to which the rate limit applies.
Scopes allow you to apply rate limits to specific subsets of requests based on tags or API keys.
The following scopes are supported:
* Tags:
* `tag_key` (string): The tag key to match against.
* `tag_value` (string): The tag value to match against. This can be:
* `tensorzero::each`: Apply the limit separately to each unique value of the tag.
* `tensorzero::total`: Apply the limit to the aggregate of all requests with this tag, regardless of the tag's value.
* Any other string: Apply the limit only when the tag has this specific value.
* API Key Public ID (requires authentication to be enabled):
* `api_key_public_id` (string): The API key public ID to match against. This can be:
* `tensorzero::each`: Apply the limit separately to each API key.
* A specific 12-character public ID: Apply the limit only to requests authenticated with this API key.
For example:
```toml theme={null}
# Each individual user can make a maximum of 1 model inference per minute
[[rate_limiting.rules]]
priority = 0
model_inferences_per_minute = 1
scope = [
{ tag_key = "user_id", tag_value = "tensorzero::each" }
]
# But override the individual limit for the CEO
[[rate_limiting.rules]]
priority = 1
model_inferences_per_minute = 5
scope = [
{ tag_key = "user_id", tag_value = "ceo" }
]
# Each API key can make a maximum of 100 model inferences per hour
[[rate_limiting.rules]]
priority = 0
model_inferences_per_hour = 100
scope = [
{ api_key_public_id = "tensorzero::each" }
]
# But override the limit for a specific API key
[[rate_limiting.rules]]
priority = 1
model_inferences_per_hour = 1000
scope = [
{ api_key_public_id = "xxxxxxxxxxxx" }
]
```
# How to configure functions & variants
Source: https://www.tensorzero.com/docs/gateway/configure-functions-and-variants
Learn how to configure functions and variants to define your LLM application logic with TensorZero.
* A **function** represents a task or agent in your application (e.g. "write a product description" or "answer a customer question").
* A **variant** is a specific way to accomplish it: a choice of model, prompt, inference parameters, etc.
You can call models directly when getting started, but functions and variants unlock powerful capabilities as your application matures.
Some of the benefits include:
* **[Collect metrics and feedback](/gateway/guides/metrics-feedback):** Track performance and gather feedback for optimization.
* **[Run A/B tests](/experimentation/run-adaptive-ab-tests):** Experiment with different models, prompts, and parameters.
* **[Create prompt templates](/gateway/create-a-prompt-template):** Decouple prompts from application code for easier iteration.
* **[Configure retries & fallbacks](/gateway/guides/retries-fallbacks):** Build systems that handle provider downtime gracefully.
* **[Use advanced inference strategies](/gateway/guides/inference-time-optimizations):** Easily implement advanced inference-time optimizations like dynamic in-context-learning and best-of-N sampling.
## Configure functions & variants
TensorZero supports two function types:
* **`chat`** is the typical chat interface used by most LLMs. It returns unstructured text responses.
* **`json`** is for structured outputs. It returns responses that conform to a JSON schema. See [Generate structured outputs (JSON)](/gateway/generate-structured-outputs).
The skeleton of a function configuration looks like this:
```toml title="tensorzero.toml" theme={null}
[functions.my_function_name]
type = "..." # "chat" or "json"
# ... other fields depend on the function type ...
```
A variant is a particular implementation of a function.
It specifies the model to use, prompt templates, decoding strategy, hyperparameters, and other settings.
The skeleton of a variant configuration looks like this:
```toml title="tensorzero.toml" theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "..." # e.g. "chat_completion"
model = "..." # e.g. "openai::gpt-5" or "my_gpt_5"
# ... other fields (e.g. prompt templates, inference parameters) ...
```
The simplest variant type is **`chat_completion`**, which is the typical chat completion format used by OpenAI and many other LLM providers.
TensorZero supports other variant types that implement [inference-time optimizations](/gateway/guides/inference-time-optimizations/).
You can define prompt templates in your variant configuration rather than sending prompts directly in your inference requests.
This decouples prompts from application code and enables easier experimentation and optimization.
See [Create a prompt template](/gateway/create-a-prompt-template) for more details.
If you define multiple variants, TensorZero will randomly sample one of them at inference time.
You can define more advanced experimentation strategies (e.g. [Run adaptive A/B tests](/experimentation/run-adaptive-ab-tests/)), fallback-only variants (e.g. [Retries & Fallbacks](/gateway/guides/retries-fallbacks/)), and more.
### Example
Let's create a function called `answer_customer` with two variants: GPT-5 and Claude Sonnet 4.5.
```toml title="tensorzero.toml" theme={null}
[functions.answer_customer]
type = "chat"
[functions.answer_customer.variants.gpt_5_baseline]
type = "chat_completion"
model = "openai::gpt-5"
[functions.answer_customer.variants.claude_sonnet_4_5]
type = "chat_completion"
model = "anthropic::claude-sonnet-4-5"
```
You can now call the `answer_customer` function and TensorZero will randomly select one of the two variants for each request.
## Make inference requests
Once you've configured a function and its variants, you can make inference requests to the TensorZero Gateway.
```python theme={null}
result = t0.inference(
function_name="answer_customer",
input={
"messages": [
{"role": "user", "content": "What is your return policy?"},
],
},
)
```
```python theme={null}
result = client.chat.completions.create(
model="tensorzero::function_name::answer_customer",
messages=[
{"role": "user", "content": "What is your return policy?"},
],
)
```
```ts theme={null}
const response = await client.chat.completions.create({
model: "tensorzero::function_name::answer_customer",
messages: [
{
role: "user",
content: "What is your return policy?",
},
],
});
```
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"function_name": "answer_customer",
"input": {
"messages": [
{
"role": "user",
"content": "What is your return policy?"
}
]
}
}'
```
See [Call any LLM](/gateway/call-any-llm) for complete examples including setup and sample responses.
# How to configure models & providers
Source: https://www.tensorzero.com/docs/gateway/configure-models-and-providers
Learn how to configure models and model providers to access LLMs with TensorZero.
* A **model** specifies a particular LLM (e.g. GPT-5 or your fine-tuned Llama 3).
* A **model provider** specifies how you can access a given model (e.g. GPT-5 is available through both OpenAI and Azure).
You can call models directly using the inference endpoint or use them with [functions and variants](/gateway/configure-functions-and-variants) in TensorZero.
## Configure a model & model provider
A model has an arbitrary name and a list of providers.
Each provider has an arbitrary name, a type, and other fields that depend on the provider type.
The skeleton of a model and provider configuration looks like this:
```toml title="tensorzero.toml" "my_model_name" "my_provider_name" /"([.][.][.])"/ theme={null}
[models.my_model_name]
routing = ["my_provider_name"]
[models.my_model_name.providers.my_provider_name]
type = "..." # e.g. "openai"
# ... other fields depend on the provider type ...
```
TensorZero supports proprietary models (e.g. OpenAI, Anthropic), inference services (e.g. Fireworks AI, Together AI), and self-hosted LLMs (e.g. vLLM), including your own fine-tuned models on each of these.
See [Integrations](/integrations/model-providers/) for a complete list of supported providers and the [Configuration Reference](/gateway/configuration-reference/#modelsmodel_nameprovidersprovider_name) for all available configuration parameters.
### Example: GPT-5 + OpenAI
Let's configure a provider for GPT-5 from OpenAI.
We'll call our model `my_gpt_5` and our provider `my_openai_provider` with type `openai`.
The only required field for the `openai` provider is `model_name`.
```toml title="tensorzero.toml" ins="my_gpt_5" ins="my_openai_provider" ins="openai" ins="model_name = "gpt-5"" theme={null}
[models.my_gpt_5]
routing = ["my_openai_provider"]
[models.my_gpt_5.providers.my_openai_provider]
type = "openai"
model_name = "gpt-5"
```
You can now reference the model `my_gpt_5` when calling the inference endpoint or when configuring functions and variants.
## Configure multiple providers for fallback & routing
You can configure multiple providers for the same model to enable automatic fallbacks.
The gateway will try each provider in the `routing` field in order until one succeeds.
This helps mitigate provider downtime and rate limiting.
For example, you might configure both OpenAI and Azure as providers for GPT-5:
```toml title="tensorzero.toml" mark="routing = ["my_openai_provider", "my_azure_provider"]" theme={null}
[models.my_gpt_5]
routing = ["my_openai_provider", "my_azure_provider"]
[models.my_gpt_5.providers.my_openai_provider]
type = "openai"
model_name = "gpt-5"
[models.my_gpt_5.providers.my_azure_provider]
type = "azure"
deployment_id = "gpt-5"
endpoint = "https://your-resource.openai.azure.com"
```
See [Retries & Fallbacks](/gateway/guides/retries-fallbacks/) for more details on configuring robust routing strategies.
## Use short-hand model names
If you don't need advanced functionality like fallback routing or custom credentials, you can use shorthand model names directly in your variant configuration.
TensorZero supports shorthand names like:
* `openai::gpt-5`
* `anthropic::claude-haiku-4-5`
* `google::gemini-2.5-flash`
You can use these directly in a variant's `model` field without defining a separate model configuration block.
```toml title="tensorzero.toml" mark="model = "openai::gpt-5"" theme={null}
[functions.my_function.variants.my_variant]
type = "chat_completion"
model = "openai::gpt-5"
# ...
```
# How to create a prompt template
Source: https://www.tensorzero.com/docs/gateway/create-a-prompt-template
Learn how to use prompt templates and schemas to manage complexity in your prompts.
## Why create a prompt template?
Prompt templates and schemas simplify engineering iteration, experimentation, and optimization, especially as application complexity and team size grow.
Notably, they enable you to:
1. **Decouple prompts from application code.**
As you iterate on your prompts over time (or [A/B test different prompts](/experimentation/run-adaptive-ab-tests)), you'll be able to manage them in a centralized way without making changes to the application code.
2. **Collect a structured inference dataset.**
Imagine down the road you want to [fine-tune a model](/recipes/) using your historical data.
If you had only stored prompts as strings, you'd be stuck with the outdated prompts that were actually used at inference time.
However, if you had access to the input variables in a structured dataset, you'd easily be able to counterfactually swap new prompts into your training data before fine-tuning.
This is particularly important when experimenting with new models, because prompts don't always translate well between them.
3. **Implement model-specific prompts.**
We often find that the best prompt for one model is different from the best prompt for another.
As you try out different models, you'll need to be able to independently vary the prompt and the model and try different combinations thereof.
This is commonly challenging to implement in application code, but trivial in TensorZero.
You can also find a complete runnable example for this guide on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/gateway/create-a-prompt-template).
## Set up a prompt template
Create a file with your MiniJinja template:
```minijinja title="config/functions/fun_fact/gpt_5_mini/fun_fact_topic_template.minijinja" theme={null}
Share a fun fact about: {{ topic }}
```
TensorZero uses the [MiniJinja templating language](https://docs.rs/minijinja/latest/minijinja/syntax/index.html).
MiniJinja is [mostly compatible with Jinja2](https://github.com/mitsuhiko/minijinja/blob/main/COMPATIBILITY.md), which is used by many popular projects like Flask and Django.
MiniJinja provides a [browser playground](https://mitsuhiko.github.io/minijinja-playground/) where you can test your templates.
Next, you must declare the template in the [variant configuration](/gateway/configure-functions-and-variants).
You can do this by adding the field `templates.your_template_name.path` to your variant with a path to your template file.
For example, let's configure a template called `fun_fact_topic` for our variant:
```toml title="config/tensorzero.toml" theme={null}
[functions.fun_fact]
type = "chat"
[functions.fun_fact.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
templates.fun_fact_topic.path = "functions/fun_fact/gpt_5_mini/fun_fact_topic_template.minijinja" # relative to this file
```
You can configure multiple templates for a variant.
Use your template during inference by sending a content block with the template name and arguments.
```python theme={null}
result = t0.inference(
function_name="fun_fact",
input={
"messages": [
{
"role": "user",
"content": [
{
"type": "template",
"name": "fun_fact_topic",
"arguments": {"topic": "artificial intelligence"},
}
],
}
],
},
)
```
Use your template during inference by sending a `tensorzero::template` content block with the template name and arguments.
```python theme={null}
result = client.chat.completions.create(
model="tensorzero::function_name::fun_fact",
messages=[
{
"role": "user",
"content": [
{
"type": "tensorzero::template", # type: ignore
"name": "fun_fact_topic",
"arguments": {"topic": "artificial intelligence"},
}
],
},
],
)
```
Use your template during inference by sending a `template` content block with the template name and arguments.
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "fun_fact",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"type": "template",
"name": "fun_fact_topic",
"arguments": {
"topic": "artificial intelligence"
}
}
]
}
]
}
}'
```
## Set up a template schema
When you have multiple variants for a function, it becomes challenging to ensure all templates use consistent variable names and types.
Schemas solve this by defining a contract that validates template variables and catches configuration errors before they reach production.
Defining a schema is optional but recommended.
Create a [JSON Schema](https://json-schema.org/) for the variables used by your templates.
Let's define a schema for our previous example, which includes only a single variable `topic`:
```json title="config/functions/fun_fact/fun_fact_topic_schema.json" theme={null}
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"topic": {
"type": "string"
}
},
"required": ["topic"],
"additionalProperties": false
}
```
LLMs are great at generating JSON Schemas.
For example, the schema above was generated with the following request:
```txt theme={null}
Generate a JSON schema with a single field: `topic`.
The `topic` field is required. No additional fields are allowed.
```
You can also export JSON Schemas from [Pydantic models](https://docs.pydantic.dev/latest/concepts/json_schema/) and [Zod schemas](https://www.npmjs.com/package/zod-to-json-schema).
Then, declare your schema in your function definition using `schemas.your_schema_name.path`.
This will ensure that every variant for the function has a template named `your_schema_name`.
In our example above, this would mean updating the function definition to:
```toml theme={null}
[functions.fun_fact]
type = "chat"
schemas.fun_fact_topic.path = "functions/fun_fact/fun_fact_topic_schema.json" # relative to this file // [!code ++]
[functions.fun_fact.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
templates.fun_fact_topic.path = "functions/fun_fact/gpt_5_mini/fun_fact_topic_template.minijinja" # relative to this file
```
## Re-use prompt snippets
You can enable template file system access to reuse shared snippets in your prompts.
To use the MiniJinja directives `{% include %}` and `{% import %}`, set `gateway.template_filesystem_access.base_path` in your configuration.
See [Organize your configuration](/operations/organize-your-configuration#enable-template-file-system-access-to-reuse-shared-snippets) for details.
## Migrate from legacy prompt templates
In earlier versions of TensorZero, prompt templates were defined as `system_template`, `user_template`, and `assistant_template`.
Similarly, template schemas were defined as `system_schema`, `user_schema`, and `assistant_schema`.
This legacy approach limited the flexibility of prompt templates restricting the ability to define multiple templates per role.
As you create new functions and templates, you should use the new `templates.your_template_name.path` format.
Historical observability data stored in your ClickHouse database still uses the legacy format.
If you want to keep this data forward-compatible (e.g. for fine-tuning), you can update your configuration as follows:
| Legacy Configuration | Updated Configuration |
| -------------------- | -------------------------- |
| `system_template` | `templates.system.path` |
| `system_schema` | `schemas.system.path` |
| `user_template` | `templates.user.path` |
| `user_schema` | `schemas.user.path` |
| `assistant_template` | `templates.assistant.path` |
| `assistant_schema` | `schemas.assistant.path` |
As we deprecate the legacy format, TensorZero will automatically look for templates and schemas in the new format for your historical data.
# Data Model
Source: https://www.tensorzero.com/docs/gateway/data-model
Learn more about the data model used by TensorZero.
The TensorZero Gateway stores inference and feedback data in ClickHouse.
This data can be used for observability, experimentation, and optimization.
## `ChatInference`
The `ChatInference` table stores information about inference requests for Chat Functions made to the TensorZero Gateway.
A `ChatInference` row can be associated with one or more `ModelInference` rows, depending on the variant's `type`.
For `chat_completion`, there will be a one-to-one relationship between rows in the two tables.
For other variant types, there might be more associated rows.
| Column | Type | Notes |
| :------------------- | :-----------------: | -------------------------------------------------------------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `function_name` | String | |
| `variant_name` | String | |
| `episode_id` | UUID | Must be a UUIDv7 |
| `input` | String (JSON) | `input` field in the `/inference` request body |
| `output` | String (JSON) | Array of content blocks |
| `tool_params` | String (JSON) | Object with any tool parameters (e.g. `tool_choice`, `tools_available`) used for the inference |
| `inference_params` | String (JSON) | Object with any inference parameters per variant type (e.g. `{"chat_completion": {"temperature": 0.5}}`) |
| `processing_time_ms` | UInt32 | |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"user_id": "123"}`) |
## `JsonInference`
The `JsonInference` table stores information about inference requests for JSON Functions made to the TensorZero Gateway.
A `JsonInference` row can be associated with one or more `ModelInference` rows, depending on the variant's `type`.
For `chat_completion`, there will be a one-to-one relationship between rows in the two tables.
For other variant types, there might be more associated rows.
| Column | Type | Notes |
| :------------------- | :-----------------: | -------------------------------------------------------------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `function_name` | String | |
| `variant_name` | String | |
| `episode_id` | UUID | Must be a UUIDv7 |
| `input` | String (JSON) | `input` field in the `/inference` request body |
| `output` | String (JSON) | Object with `parsed` and `raw` fields |
| `output_schema` | String (JSON) | Schema that the output must conform to |
| `inference_params` | String (JSON) | Object with any inference parameters per variant type (e.g. `{"chat_completion": {"temperature": 0.5}}`) |
| `processing_time_ms` | UInt32 | |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"user_id": "123"}`) |
## `ModelInference`
The `ModelInference` table stores information about each inference request to a model provider.
This is the inference request you'd make if you had called the model provider directly.
| Column | Type | Notes |
| :-------------------- | :-------------------: | ---------------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `inference_id` | UUID | Must be a UUIDv7 |
| `raw_request` | String | Raw request as sent to the model provider (varies) |
| `raw_response` | String | Raw response from the model provider (varies) |
| `model_name` | String | Name of the model used for the inference |
| `model_provider_name` | String | Name of the model provider used for the inference |
| `input_tokens` | Nullable(UInt32) | |
| `output_tokens` | Nullable(UInt32) | |
| `response_time_ms` | Nullable(UInt32) | |
| `ttft_ms` | Nullable(UInt32) | Only available in streaming inferences |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `system` | Nullable(String) | The `system` input to the model |
| `input_messages` | Array(RequestMessage) | The user and assistant messages input to the model |
| `output` | Array(ContentBlock) | The output of the model |
A `RequestMessage` is an object with shape `{role: "user" | "assistant", content: List[ContentBlock]}` (content blocks are defined [here](/gateway/api-reference/inference/#content-block)).
## `DynamicInContextLearningExample`
The `DynamicInContextLearningExample` table stores examples for dynamic in-context learning variants.
| Column | Type | Notes |
| :-------------- | :------------: | ---------------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `function_name` | String | |
| `variant_name` | String | |
| `namespace` | String | |
| `input` | String (JSON) | |
| `output` | String | |
| `embedding` | Array(Float32) | |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
## `BooleanMetricFeedback`
The `BooleanMetricFeedback` table stores feedback for metrics of `type = "boolean"`.
| Column | Type | Notes |
| :------------ | :-----------------: | ---------------------------------------------------------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `target_id` | UUID | Must be a UUIDv7 that is either `inference_id` or `episode_id` depending on `level` in metric config |
| `metric_name` | String | |
| `value` | Bool | |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"author": "Alice"}`) |
## `FloatMetricFeedback`
The `FloatMetricFeedback` table stores feedback for metrics of `type = "float"`.
| Column | Type | Notes |
| :------------ | :-----------------: | ---------------------------------------------------------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `target_id` | UUID | Must be a UUIDv7 that is either `inference_id` or `episode_id` depending on `level` in metric config |
| `metric_name` | String | |
| `value` | Float32 | |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"author": "Alice"}`) |
## `CommentFeedback`
The `CommentFeedback` table stores feedback provided with `metric_name` of `"comment"`.
Comments are free-form text feedbacks.
| Column | Type | Notes |
| :------------ | :--------------------------: | ---------------------------------------------------------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `target_id` | UUID | Must be a UUIDv7 that is either `inference_id` or `episode_id` depending on `level` in metric config |
| `target_type` | `"inference"` or `"episode"` | |
| `value` | String | |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"author": "Alice"}`) |
## `DemonstrationFeedback`
The `DemonstrationFeedback` table stores feedback in the form of demonstrations.
Demonstrations are examples of good behaviors.
| Column | Type | Notes |
| :------------- | :-----------------: | ------------------------------------------------------------------------------ |
| `id` | UUID | Must be a UUIDv7 |
| `inference_id` | UUID | Must be a UUIDv7 |
| `value` | String | The demonstration or example provided as feedback (must match function output) |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"author": "Alice"}`) |
## `ModelInferenceCache`
The `ModelInferenceCache` table stores cached model inference results to avoid duplicate requests.
| Column | Type | Notes |
| :---------------- | :-------------: | ---------------------------------------------------- |
| `short_cache_key` | UInt64 | First part of composite key for fast lookups |
| `long_cache_key` | FixedString(64) | Hex-encoded 256-bit key for full cache validation |
| `timestamp` | DateTime | When this cache entry was created, defaults to now() |
| `output` | String | The cached model output |
| `raw_request` | String | Raw request that was sent to the model provider |
| `raw_response` | String | Raw response received from the model provider |
| `is_deleted` | Bool | Soft deletion flag, defaults to false |
The table uses the `ReplacingMergeTree` engine with `timestamp` and `is_deleted` columns for deduplication.
It is partitioned by month and ordered by the composite cache key `(short_cache_key, long_cache_key)`.
The `short_cache_key` serves as the primary key for performance, while a bloom filter index on `long_cache_key`
helps optimize point queries.
## `ChatInferenceDatapoint`
The `ChatInferenceDatapoint` table stores chat inference examples organized into datasets.
| Column | Type | Notes |
| :-------------- | :---------------------: | ---------------------------------------------------------------------------------------------- |
| `dataset_name` | LowCardinality(String) | Name of the dataset this example belongs to |
| `function_name` | LowCardinality(String) | Name of the function this example is for |
| `id` | UUID | Must be a UUIDv7, often the inference ID if generated from an inference |
| `episode_id` | UUID | Must be a UUIDv7 |
| `input` | String (JSON) | `input` field in the `/inference` request body |
| `output` | Nullable(String) (JSON) | Array of content blocks |
| `tool_params` | String (JSON) | Object with any tool parameters (e.g. `tool_choice`, `tools_available`) used for the inference |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"user_id": "123"}`) |
| `auxiliary` | String | Additional JSON data (unstructured) |
| `is_deleted` | Bool | Soft deletion flag, defaults to false |
| `updated_at` | DateTime | When this dataset entry was updated, defaults to now() |
The table uses the `ReplacingMergeTree` engine with `updated_at` and `is_deleted` columns for deduplication.
It is ordered by `dataset_name`, `function_name`, and `id` to optimize queries filtering by dataset and function.
## `JsonInferenceDatapoint`
The `JsonInferenceDatapoint` table stores JSON inference examples organized into datasets.
| Column | Type | Notes |
| :-------------- | :--------------------: | ----------------------------------------------------------------------- |
| `dataset_name` | LowCardinality(String) | Name of the dataset this example belongs to |
| `function_name` | LowCardinality(String) | Name of the function this example is for |
| `id` | UUID | Must be a UUIDv7, often the inference ID if generated from an inference |
| `episode_id` | UUID | Must be a UUIDv7 |
| `input` | String (JSON) | `input` field in the `/inference` request body |
| `output` | String (JSON) | Object with `parsed` and `raw` fields |
| `output_schema` | String (JSON) | Schema that the output must conform to |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"user_id": "123"}`) |
| `auxiliary` | String | Additional JSON data (unstructured) |
| `is_deleted` | Bool | Soft deletion flag, defaults to false |
| `updated_at` | DateTime | When this dataset entry was updated, defaults to now() |
The table uses the `ReplacingMergeTree` engine with `updated_at` and `is_deleted` columns for deduplication.
It is ordered by `dataset_name`, `function_name`, and `id` to optimize queries filtering by dataset and function.
## `BatchRequest`
The `BatchRequest` table stores information about batch requests made to model providers. We update it every time a particular `batch_id` is created or polled.
| Column | Type | Notes |
| :-------------------- | :-----------: | ---------------------------------------------------------- |
| `batch_id` | UUID | Must be a UUIDv7 |
| `id` | UUID | Must be a UUIDv7 |
| `batch_params` | String | Parameters used for the batch request |
| `model_name` | String | Name of the model used |
| `model_provider_name` | String | Name of the model provider |
| `status` | String | One of: 'pending', 'completed', 'failed' |
| `errors` | Array(String) | Array of error messages if status is 'failed' |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
| `raw_request` | String | Raw request sent to the model provider |
| `raw_response` | String | Raw response received from the model provider |
| `function_name` | String | Name of the function being called |
| `variant_name` | String | Name of the function variant |
## `BatchModelInference`
The `BatchModelInference` table stores information about inferences made as part of a batch request.
Once the request succeeds, we use this information to populate the `ChatInference`, `JsonInference`, and `ModelInference` tables.
| Column | Type | Notes |
| :-------------------- | :-------------------: | -------------------------------------------------------------------------------------------------------- |
| `inference_id` | UUID | Must be a UUIDv7 |
| `batch_id` | UUID | Must be a UUIDv7 |
| `function_name` | String | Name of the function being called |
| `variant_name` | String | Name of the function variant |
| `episode_id` | UUID | Must be a UUIDv7 |
| `input` | String (JSON) | `input` field in the `/inference` request body |
| `system` | String | The `system` input to the model |
| `input_messages` | Array(RequestMessage) | The user and assistant messages input to the model |
| `tool_params` | String (JSON) | Object with any tool parameters (e.g. `tool_choice`, `tools_available`) used for the inference |
| `inference_params` | String (JSON) | Object with any inference parameters per variant type (e.g. `{"chat_completion": {"temperature": 0.5}}`) |
| `raw_request` | String | Raw request sent to the model provider |
| `model_name` | String | Name of the model used |
| `model_provider_name` | String | Name of the model provider |
| `output_schema` | String | Optional schema for JSON outputs |
| `tags` | Map(String, String) | User-assigned tags (e.g. `{"author": "Alice"}`) |
| `timestamp` | DateTime | Materialized from `id` (using `UUIDv7ToDateTime` function) |
[Materialized views](https://clickhouse.com/docs/en/materialized-view) in columnar databases like ClickHouse pre-compute alternative indexings of data, dramatically improving query performance compared to computing results on-the-fly.
In TensorZero's case, we store denormalized data about inferences and feedback in the materialized views below to support efficient queries for common downstream use cases.
## `FeedbackTag`
The `FeedbackTag` table stores tags associated with various feedback types. Tags are used to categorize and add metadata to feedback entries, allowing for user-defined filtering later on. Data is inserted into this table by materialized views reading from the `BooleanMetricFeedback`, `CommentFeedback`, `DemonstrationFeedback`, and `FloatMetricFeedback` tables.
| Column | Type | Notes |
| ------------- | ------ | ------------------------------------------------------------------------------- |
| `metric_name` | String | Name of the metric the tag is associated with. |
| `key` | String | Key of the tag. |
| `value` | String | Value of the tag. |
| `feedback_id` | UUID | UUID referencing the related feedback entry (e.g., `BooleanMetricFeedback.id`). |
## `InferenceById`
The `InferenceById` table is a materialized view that combines data from `ChatInference` and `JSONInference`.
Notably, it indexes the table by `id_uint` for fast lookup by the gateway to validate feedback requests.
We store `id_uint` as a UInt128 so that they are sorted in the natural order by time as ClickHouse sorts UUIDs in little-endian order.
| Column | Type | Notes |
| :-------------- | :-----: | -------------------------------------------------- |
| `id_uint` | UInt128 | Integer representation of UUIDv7 for sorting order |
| `function_name` | String | |
| `variant_name` | String | |
| `episode_id` | UUID | Must be a UUIDv7 |
| `function_type` | String | Either `'chat'` or `'json'` |
## `InferenceByEpisodeId`
The `InferenceByEpisodeId` table is a materialized view that indexes inferences by their episode ID, enabling efficient lookup of all inferences within an episode.
We store `episode_id_uint` as a `UInt128` so that they are sorted in the natural order by time as ClickHouse sorts UUIDs in little-endian order.
| Column | Type | Notes |
| :---------------- | :------------------: | -------------------------------------------------- |
| `episode_id_uint` | UInt128 | Integer representation of UUIDv7 for sorting order |
| `id_uint` | UInt128 | Integer representation of UUIDv7 for sorting order |
| `function_name` | String | Name of the function being called |
| `variant_name` | String | Name of the function variant |
| `function_type` | Enum('chat', 'json') | Type of function (chat or json) |
## `InferenceTag`
The `InferenceTag` table stores tags associated with inferences. Tags are used to categorize and add metadata to inferences, allowing for user-defined filtering later on. Data is inserted into this table by materialized views reading from the `ChatInference` and `JsonInference` tables.
| Column | Type | Notes |
| --------------- | ------ | ------------------------------------------------------------------ |
| `function_name` | String | Name of the function the tag is associated with. |
| `key` | String | Key of the tag. |
| `value` | String | Value of the tag. |
| `inference_id` | UUID | UUID referencing the related inference (e.g., `ChatInference.id`). |
## `BatchIdByInferenceId`
The `BatchIdByInferenceId` table maps inference IDs to batch IDs, allowing for efficient lookup of which batch an inference belongs to.
| Column | Type | Notes |
| :------------- | :--: | ---------------- |
| `inference_id` | UUID | Must be a UUIDv7 |
| `batch_id` | UUID | Must be a UUIDv7 |
## `BooleanMetricFeedbackByTargetId`
The `BooleanMetricFeedbackByTargetId` table indexes boolean metric feedback by target ID, enabling efficient lookup of feedback for a specific target.
| Column | Type | Notes |
| :------------ | :-----------------: | ---------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `target_id` | UUID | Must be a UUIDv7 |
| `metric_name` | String | Name of the metric (stored as LowCardinality) |
| `value` | Bool | The boolean feedback value |
| `tags` | Map(String, String) | Key-value pairs of tags associated with the feedback |
## `CommentFeedbackByTargetId`
The `CommentFeedbackByTargetId` table stores text feedback associated with inferences or episodes, enabling efficient lookup of comments by their target ID.
| Column | Type | Notes |
| :------------ | :--------------------------: | ---------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `target_id` | UUID | Must be a UUIDv7 |
| `target_type` | Enum('inference', 'episode') | Type of entity this feedback is for |
| `value` | String | The text feedback content |
| `tags` | Map(String, String) | Key-value pairs of tags associated with the feedback |
## `DemonstrationFeedbackByInferenceId`
The `DemonstrationFeedbackByInferenceId` table stores demonstration feedback associated with inferences, enabling efficient lookup of demonstrations by inference ID.
| Column | Type | Notes |
| :------------- | :-----------------: | ---------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `inference_id` | UUID | Must be a UUIDv7 |
| `value` | String | The demonstration feedback content |
| `tags` | Map(String, String) | Key-value pairs of tags associated with the feedback |
## `FloatMetricFeedbackByTargetId`
The `FloatMetricFeedbackByTargetId` table indexes float metric feedback by target ID, enabling efficient lookup of feedback for a specific target.
| Column | Type | Notes |
| :------------ | :-----------------: | ---------------------------------------------------- |
| `id` | UUID | Must be a UUIDv7 |
| `target_id` | UUID | Must be a UUIDv7 |
| `metric_name` | String | Name of the metric (stored as LowCardinality) |
| `value` | Float32 | The float feedback value |
| `tags` | Map(String, String) | Key-value pairs of tags associated with the feedback |
# How to generate embeddings
Source: https://www.tensorzero.com/docs/gateway/generate-embeddings
Learn how to generate embeddings from many model providers using the TensorZero Gateway with a unified API.
This page shows how to:
* **Generate embeddings with a unified API.** TensorZero unifies many LLM APIs (e.g. OpenAI) and inference servers (e.g. Ollama).
* **Use any programming language.** You can use any OpenAI SDK (Python, Node, Go, etc.) or the OpenAI-compatible HTTP API.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/embeddings) of this guide on GitHub.
## Generate embeddings from OpenAI
Our example uses the OpenAI Python SDK, but you can use any OpenAI SDK or call the OpenAI-compatible HTTP API.
See [Call any LLM](/gateway/call-any-llm) for an example using the OpenAI Node SDK.
The TensorZero Python SDK doesn't have an independent embedding endpoint at the moment.
The TensorZero Python SDK integrates with the OpenAI Python SDK to provide a unified API for calling any LLM.
For example, if you're using OpenAI, you can set the `OPENAI_API_KEY` environment variable with your API key.
```bash theme={null}
export OPENAI_API_KEY="sk-..."
```
See the [Integrations](/integrations/model-providers) page to learn how to set up credentials for other LLM providers.
You can install the OpenAI and TensorZero SDKs with a Python package manager like `pip`.
```bash theme={null}
pip install openai tensorzero
```
Let's initialize the TensorZero Gateway and patch the OpenAI client to use it.
For simplicity, we'll use an embedded gateway without observability or custom configuration.
```python theme={null}
from openai import OpenAI
from tensorzero import patch_openai_client
client = OpenAI()
patch_openai_client(client, async_setup=False)
```
The TensorZero Python SDK supports both the synchronous `OpenAI` client and the asynchronous `AsyncOpenAI` client.
Both options support running the gateway embedded in your application with `patch_openai_client` or connecting to a standalone gateway with `base_url`.
The embedded gateway supports synchronous initialization with `async_setup=False` or asynchronous initialization with `async_setup=True`.
See [Clients](/gateway/clients/) for more details.
```python theme={null}
result = client.embeddings.create(
input="Hello, world!",
model="tensorzero::embedding_model_name::openai::text-embedding-3-small",
# or: Azure, any OpenAI-compatible endpoint (e.g. Ollama, Voyager)
)
```
```python theme={null}
CreateEmbeddingResponse(
data=[
Embedding(
embedding=[
-0.019143931567668915,
# ...
],
index=0,
object='embedding'
)
],
model='tensorzero::embedding_model_name::openai::text-embedding-3-small',
object='list',
usage=Usage(prompt_tokens=4, total_tokens=4)
)
```
## Define a custom embedding model
You can define a custom embedding model in your TensorZero configuration file.
For example, let's define a custom embedding model for `nomic-embed-text` served locally by Ollama.
Download the embedding model and launch the Ollama server:
```bash theme={null}
ollama pull nomic-embed-text
ollama serve
```
We assume that Ollama is available on `http://localhost:11434`.
Add your custom model and model provider to your configuration file:
```toml title="tensorzero.toml" theme={null}
[embedding_models.nomic-embed-text]
routing = ["ollama"]
[embedding_models.nomic-embed-text.providers.ollama]
type = "openai"
api_base = "http://localhost:11434/v1"
model_name = "nomic-embed-text"
api_key_location = "none"
```
See the [Configuration Reference](/gateway/configuration-reference#%5Bembedding-models-model-name%5D) for details on configuring your embedding models.
Use your custom model by referencing it with `tensorzero::embedding_model_name::nomic-embed-text`.
For example, using the OpenAI Python SDK:
```python theme={null}
from openai import OpenAI
from tensorzero import patch_openai_client
client = OpenAI()
patch_openai_client(
client,
config_file="config/tensorzero.toml",
async_setup=False,
)
result = client.embeddings.create(
input="Hello, world!",
model="tensorzero::embedding_model_name::nomic-embed-text",
)
```
```python theme={null}
CreateEmbeddingResponse(
data=[
Embedding(
embedding=[
-0.019143931567668915,
# ...
],
index=0,
object='embedding'
)
],
model='tensorzero::embedding_model_name::nomic-embed-text',
object='list',
usage=Usage(prompt_tokens=4, total_tokens=4)
)
```
## Cache embeddings
The TensorZero Gateway supports caching embeddings to improve latency and reduce costs.
When caching is enabled, identical embedding requests will be served from the cache instead of being sent to the model provider.
```python theme={null}
result = client.embeddings.create(
input="Hello, world!",
model="tensorzero::embedding_model_name::openai::text-embedding-3-small",
extra_body={
"tensorzero::cache_options": {
"enabled": "on", # Enable reading from and writing to cache
"max_age_s": 3600, # Optional: cache entries older than 1 hour are ignored
}
}
)
```
Caching works for single embeddings.
Batch embedding requests (multiple inputs) will write to the cache but won't serve cached responses.
See the [Inference Caching](/gateway/guides/inference-caching) guide for more details on cache modes and options.
# How to generate structured outputs
Source: https://www.tensorzero.com/docs/gateway/generate-structured-outputs
Learn how to generate structured outputs (JSON) effectively using TensorZero.
[TensorZero Functions](/gateway/configure-functions-and-variants) come in two flavors:
* **`chat`:** the default choice for most LLM chat completion use cases
* **`json`:** a specialized function type when your goal is generating structured outputs
As a rule of thumb, you should use JSON functions if you have a single, well-defined output schema.
If you need more flexibility (e.g. letting the model pick between multiple tools, or whether to pick a tool at all), then Chat Functions with [tool use](/gateway/guides/tool-use) might be a better fit.
## Generate structured outputs with a static schema
Let's create a JSON function for one of its typical use cases: data extraction.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/gateway/generate-structured-outputs) of this guide on GitHub.
Create a configuration file that defines your JSON function with the output schema and JSON mode.
If you don't specify an `output_schema`, the gateway will default to accepting any valid JSON output.
```toml tensorzero.toml theme={null}
[functions.extract_data]
type = "json"
output_schema = "output_schema.json" # optional
[functions.extract_data.variants.baseline]
type = "chat_completion"
model = "openai::gpt-5-mini"
system_template = "system_template.minijinja"
json_mode = "strict"
```
The field `json_mode` can be one of the following: `off`, `on`, `strict`, or `tool`.
The `tool` strategy is a custom TensorZero implementation that leverages tool use under the hood for generating JSON.
See [Configuration Reference](/gateway/configuration-reference) for details.
Use `"strict"` mode for providers that support it (e.g. OpenAI) or `"tool"` for others.
If you choose to specify a schema, place it in the relevant file:
```json output_schema.json theme={null}
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": ["string", "null"],
"description": "The customer's full name"
},
"email": {
"type": ["string", "null"],
"description": "The customer's email address"
}
},
"required": ["name", "email"],
"additionalProperties": false
}
```
Create a template that instructs the model to extract the information you need.
```txt system_template.minijinja theme={null}
You are a helpful AI assistant that extracts customer information from messages.
Extract the customer's name and email address if present. Use null for any fields that are not found.
Your output should be a JSON object with the following schema:
{
"name": string or null,
"email": string or null
}
---
Examples:
User: Hi, I'm Sarah Johnson and you can reach me at sarah.j@example.com
Assistant: {"name": "Sarah Johnson", "email": "sarah.j@example.com"}
User: My email is contact@company.com
Assistant: {"name": null, "email": "contact@company.com"}
User: This is John Doe reaching out
Assistant: {"name": "John Doe", "email": null}
```
Including examples in your prompt helps the model understand the expected output format and improves accuracy.
When using the TensorZero SDK, the response will include `raw` and `parsed` values.
The `parsed` field contains the validated JSON object.
If the output doesn't match the schema or isn't valid JSON, `parsed` will be `None` and you can fall back to the `raw` string output.
```python theme={null}
from tensorzero import TensorZeroGateway
t0 = TensorZeroGateway.build_http(gateway_url="http://localhost:3000")
response = t0.inference(
function_name="extract_data",
input={
"messages": [
{
"role": "user",
"content": "Hi, I'm Sarah Johnson and you can reach me at sarah.j@example.com",
}
]
},
)
```
```python theme={null}
JsonInferenceResponse(
inference_id=UUID('019a78dc-0045-79e2-9629-cbcd47674abe'),
episode_id=UUID('019a78dc-0045-79e2-9629-cbdaf9d830bd'),
variant_name='baseline',
output=JsonInferenceOutput(
raw='{"name":"Sarah Johnson","email":"sarah.j@example.com"}',
parsed={'name': 'Sarah Johnson', 'email': 'sarah.j@example.com'}
),
usage=Usage(input_tokens=252, output_tokens=26),
finish_reason=,
raw_response=None
)
```
When using the OpenAI SDK, the response `content` is the JSON string generated by the model.
TensorZero does not return a validated object.
```python theme={null}
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3000/openai/v1",
api_key="unused",
)
response = client.chat.completions.create(
model="tensorzero::function_name::extract_data",
messages=[
{
"role": "user",
"content": "Hi, I'm Sarah Johnson and you can reach me at sarah.j@example.com",
}
],
)
```
```python theme={null}
ChatCompletion(
id='019a78dd-8e77-7c21-ab70-5dc4b897f7d2',
choices=[
Choice(
finish_reason='stop',
index=0,
logprobs=None,
message=ChatCompletionMessage(
content='{"name":"Sarah Johnson","email":"sarah.j@example.com"}',
refusal=None,
role='assistant',
annotations=None,
audio=None,
function_call=None,
tool_calls=None
)
)
],
created=1762964379,
model='tensorzero::function_name::extract_data::variant_name::baseline',
object='chat.completion',
service_tier=None,
system_fingerprint='',
usage=CompletionUsage(
completion_tokens=90,
prompt_tokens=252,
total_tokens=342,
completion_tokens_details=None,
prompt_tokens_details=None
),
episode_id='019a78dd-8e77-7c21-ab70-5ddb585eb35e'
)
```
When using the OpenAI SDK, the response `content` is the JSON string generated by the model.
TensorZero does not return a validated object.
```ts theme={null}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/openai/v1",
apiKey: "unused",
});
const response = await client.chat.completions.create({
model: "tensorzero::function_name::extract_data",
messages: [
{
role: "user",
content:
"Hi, I'm Sarah Johnson and you can reach me at sarah.j@example.com",
},
],
});
```
```json theme={null}
{
"id": "019a78de-97d4-79d3-8b61-bcab4c697281",
"episode_id": "019a78de-97d4-79d3-8b61-bcb10a8c02f4",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": "{\"name\":\"Sarah Johnson\",\"email\":\"sarah.j@example.com\"}",
"tool_calls": null,
"role": "assistant"
}
}
],
"created": 1762964446,
"model": "tensorzero::function_name::extract_data::variant_name::baseline",
"system_fingerprint": "",
"service_tier": null,
"object": "chat.completion",
"usage": {
"prompt_tokens": 252,
"completion_tokens": 26,
"total_tokens": 278
}
}
```
When using the TensorZero Inference API, the response will include `raw` and `parsed` values.
The `parsed` field contains the validated JSON object.
If the output doesn't match the schema or isn't valid JSON, `parsed` will be `null` and you can fall back to the `raw` string output.
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"function_name": "extract_data",
"input": {
"messages": [
{
"role": "user",
"content": "Hi, I'\''m Sarah Johnson and you can reach me at sarah.j@example.com"
}
]
}
}'
```
```json theme={null}
{
"inference_id": "019a78da-04c5-7d53-bad7-2fed3b53b371",
"episode_id": "019a78da-04c6-7f40-9beb-6ce65621d3f7",
"variant_name": "baseline",
"output": {
"raw": "{\"name\":\"Sarah Johnson\",\"email\":\"sarah.j@example.com\"}",
"parsed": {
"name": "Sarah Johnson",
"email": "sarah.j@example.com"
}
},
"usage": {
"input_tokens": 252,
"output_tokens": 154
},
"finish_reason": "stop"
}
```
## Generate structured outputs with a dynamic schema
While we recommend specifying a fixed schema in the configuration whenever possible, you can provide the output schema dynamically at inference time if your use case demands it.
See `output_schema` in the [Inference API Reference](/gateway/api-reference/inference#output-schema) or `response_format` in the [Inference (OpenAI) API Reference](/gateway/api-reference/inference-openai-compatible#json-function-with-dynamic-output-schema).
You can also override `json_mode` at inference time if necessary.
## Set `json_mode` at inference time
You can set `json_mode` for a particular request using `params`.
This value takes precedence over any default behaviors or `json_mode` in the configuration.
You can set `json_mode` by adding `params` to the request body.
```python theme={null}
response = await t0.inference(
# ...
params={
"chat_completion": {
"json_mode": "strict", # or: "tool", "on", "off"
}
},
# ...
)
```
See the [Inference API Reference](/gateway/api-reference/inference) for more details.
You can set `json_mode` by adding `tensorzero::params` to the request's body.
The OpenAI Python SDK accepts custom parameters in the `extra_body` field.
```python theme={null}
response = client.chat.completions.create(
# ...
extra_body={
"tensorzero::params": {
"chat_completion": {
"json_mode": "strict", # or: "tool", "on", "off"
}
}
}
# ...
)
```
See the [OpenAI-Compatible Inference API Reference](/gateway/api-reference/inference-openai-compatible) for more details.
You can set `json_mode` by adding `tensorzero::params` to the request's body.
```ts theme={null}
const response = await client.chat.completions.create({
// ...
"tensorzero::params": {
chat_completion: {
json_mode: "strict", // or: "tool", "on", "off"
},
},
// ...
});
```
See the [OpenAI-Compatible Inference API Reference](/gateway/api-reference/inference-openai-compatible) for more details.
You can set `json_mode` by adding `params` to the request body.
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
// ...
"params": {
"chat_completion": {
"json_mode": "strict", // or: "tool", "on", "off"
}
}
// ...
}'
```
See the [Inference API Reference](/gateway/api-reference/inference) for more details.
Dynamic inference parameters like `json_mode` apply to specific variant types.
Unless you're using an [advanced variant type](/gateway/guides/inference-time-optimizations), the variant type will be `chat_completion`.
## Handle model provider limitations
### Anthropic
For the direct Anthropic provider, `json_mode = "strict"` automatically uses Anthropic's structured outputs feature for guaranteed schema compliance.
AWS Bedrock and GCP Vertex AI do not support Anthropic's structured outputs, so `json_mode = "strict"` falls back to prompt-based JSON mode. Use `json_mode = "tool"` for more reliable schema compliance on these providers.
For Anthropic's extended thinking models, only `json_mode = "strict"` (direct Anthropic) or `json_mode = "off"` are compatible. Other modes use prefill or forced tool use, which conflict with thinking.
### Gemini (GCP Vertex AI, Google AI Studio)
GCP Vertex AI Gemini and Google AI Studio support structured outputs, but only support a subset of the JSON Schema specification.
TensorZero automatically handles some known limitations, but certain output schemas will still be rejected by the model provider.
Refer to the [Google documentation](https://ai.google.dev/gemini-api/docs/structured-output?example=recipe#json_schema_support) for details on supported JSON Schema features.
### Lack of native support (e.g. AWS Bedrock)
Some model providers (e.g. OpenAI, Google) support strictly enforcing output schemas natively, but others (e.g. AWS Bedrock) do not.
For providers without native support, you can still generate structured outputs with `json_mode = "tool"`.
TensorZero converts your output schema into a [tool](/gateway/guides/tool-use) call, then transforms the tool response back into JSON output.
You can set `json_mode = "tool"` in your configuration file or at inference time.
When using `json_mode = "tool"`, you cannot use other tools in the same inference request.
# Batch Inference
Source: https://www.tensorzero.com/docs/gateway/guides/batch-inference
Learn how to process multiple requests at once with batch inference to save on inference costs at the expense of longer wait times.
The batch inference endpoint provides access to batch inference APIs offered by some model providers.
These APIs provide inference with large cost savings compared to real-time inference, at the expense of much higher latency (sometimes up to a day).
The batch inference workflow consists of two steps: submitting your batch request, then polling for the batch job status until completion.
See the [Batch Inference API Reference](/gateway/api-reference/batch-inference/) for more details on the batch inference endpoints, and see [Integrations](/integrations/model-providers/) for model provider integrations that support batch inference.
## Example
You can also find the runnable code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/batch-inference).
Imagine you have a simple TensorZero function that generates haikus using GPT-4o Mini.
```toml theme={null}
[functions.generate_haiku]
type = "chat"
[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"
```
You can submit a batch inference job to generate multiple haikus with a single request.
Each entry in `inputs` is equal to the `input` field in a regular inference request.
```sh theme={null}
curl -X POST http://localhost:3000/batch_inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "generate_haiku",
"variant_name": "gpt_4o_mini",
"inputs": [
{
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
},
{
"messages": [
{
"role": "user",
"content": "Write a haiku about general aviation."
}
]
},
{
"messages": [
{
"role": "user",
"content": "Write a haiku about anime."
}
]
}
]
}'
```
The response contains a `batch_id` as well as `inference_ids` and `episode_ids` for each inference in the batch.
```json theme={null}
{
"batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
"inference_ids": [
"019470f0-d34a-77a3-9e59-bcc66db2b82f",
"019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
"019470f0-d34a-77a3-9e59-bcecfb7172a0"
],
"episode_ids": [
"019470f0-d34a-77a3-9e59-bc933973d087",
"019470f0-d34a-77a3-9e59-bca6e9b748b2",
"019470f0-d34a-77a3-9e59-bcb20177bf3a"
]
}
```
You can use this `batch_id` to poll for the status of the job or retrieve the results using the `GET /batch_inference/{batch_id}` endpoint.
```sh theme={null}
curl -X GET http://localhost:3000/batch_inference/019470f0-db4c-7811-9e14-6fe6593a2652
```
While the job is pending, the response will only contain the `status` field.
```json theme={null}
{
"status": "pending"
}
```
Once the job is completed, the response will contain the `status` field and the `inferences` field.
Each inference object is the same as the response from a regular inference request.
```json theme={null}
{
"status": "completed",
"batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
"inferences": [
{
"inference_id": "019470f0-d34a-77a3-9e59-bcc66db2b82f",
"episode_id": "019470f0-d34a-77a3-9e59-bc933973d087",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Whispers of circuits, \nLearning paths through endless code, \nDreams in binary."
}
],
"usage": {
"input_tokens": 15,
"output_tokens": 19
}
},
{
"inference_id": "019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
"episode_id": "019470f0-d34a-77a3-9e59-bca6e9b748b2",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Wings of freedom soar, \nClouds embrace the lonely flight, \nSky whispers adventure."
}
],
"usage": {
"input_tokens": 15,
"output_tokens": 20
}
},
{
"inference_id": "019470f0-d34a-77a3-9e59-bcecfb7172a0",
"episode_id": "019470f0-d34a-77a3-9e59-bcb20177bf3a",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Vivid worlds unfold, \nHeroes rise with dreams in hand, \nInk and dreams collide."
}
],
"usage": {
"input_tokens": 14,
"output_tokens": 20
}
}
]
}
```
## Technical Notes
* **Observability**
* For now, pending batch inference jobs are not shown in the TensorZero UI.
You can find the relevant information in the `BatchRequest` and `BatchModelInference` tables on ClickHouse.
See [Data Model](/gateway/data-model/) for more information.
* Inferences from completed batch inference jobs are shown in the UI alongside regular inferences.
* **Experimentation**
* The gateway samples the same variant for the entire batch.
* **Python Client**
* The TensorZero Python client doesn't natively support batch inference yet.
You'll need to submit batch requests using HTTP requests, as shown above.
# Episodes
Source: https://www.tensorzero.com/docs/gateway/guides/episodes
Learn how to use episodes to manage sequences of inferences that share a common outcome.
An episode is a sequence of inferences associated with a common downstream outcome.
For example, an episode could refer to a sequence of LLM calls associated with:
* Resolving a support ticket
* Preparing an insurance claim
* Completing a phone call
* Extracting data from a document
* Drafting an email
An episode will include one or more functions, and sometimes multiple calls to the same function.
Your application can run arbitrary actions (e.g. interact with users, retrieve documents, actuate robotics) between function calls within an episode.
Though these are outside the scope of TensorZero, it is fine (and encouraged) to build your LLM systems this way.
The `/inference` endpoint accepts an optional `episode_id` field.
When you make the first inference request, you don't have to provide an `episode_id`.
The gateway will create a new episode for you and return the `episode_id` in the response.
When you make the second inference request, you must provide the `episode_id` you received in the first response.
The gateway will use the `episode_id` to associate the two inference requests together.
You shouldn't generate episode IDs yourself.
The gateway will create a new episode ID for you if you don't provide one.
Then, you can use it with other inferences you'd like to associate with the episode.
You can also find the runnable code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/episodes).
## Scenario
In the [Quickstart](/quickstart/), we built a simple LLM application that writes haikus about artificial intelligence.
Imagine we want to separately generate some commentary about the haiku, and present both pieces of content to users.
We can associate both inferences with the same episode.
Let's define an additional function in our configuration file.
```toml title="tensorzero.toml" theme={null}
[functions.analyze_haiku]
type = "chat"
[functions.analyze_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "gpt_4o_mini"
```
```toml title="tensorzero.toml" theme={null}
[models.gpt_4o_mini]
routing = ["openai"]
[models.gpt_4o_mini.providers.openai]
type = "openai"
model_name = "gpt-4o-mini"
[functions.generate_haiku]
type = "chat"
[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "gpt_4o_mini"
[functions.analyze_haiku]
type = "chat"
[functions.analyze_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "gpt_4o_mini"
```
## Inferences & Episodes
This time, we'll create a multi-step workflow that first generates a haiku and then analyzes it.
We won't provide an `episode_id` in the first inference request, so the gateway will generate a new one for us.
We'll then use that value in our second inference request.
```python title="run_with_tensorzero.py" theme={null}
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
haiku_response = client.inference(
function_name="generate_haiku",
# We don't provide an episode_id for the first inference in the episode
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
)
print(haiku_response)
# When we don't provide an episode_id, the gateway will generate a new one for us
episode_id = haiku_response.episode_id
# In a production application, we'd first validate the response to ensure the model returned the correct fields
haiku = haiku_response.content[0].text
analysis_response = client.inference(
function_name="analyze_haiku",
# For future inferences in that episode, we provide the episode_id that we received
episode_id=episode_id,
input={
"messages": [
{
"role": "user",
"content": f"Write a one-paragraph analysis of the following haiku:\n\n{haiku}",
}
]
},
)
print(analysis_response)
```
```python "01921116-0cd9-7d10-a9a6-d5c8b9ba602a" theme={null}
ChatInferenceResponse(
inference_id=UUID('01921116-0fff-7272-8245-16598966335e'),
episode_id=UUID('01921116-0cd9-7d10-a9a6-d5c8b9ba602a'),
variant_name='gpt_4o_mini',
content=[
Text(
type='text',
text='Silent circuits pulse,\nWhispers of thought in code bloom,\nMachines dream of us.',
),
],
usage=Usage(
input_tokens=15,
output_tokens=20,
),
)
ChatInferenceResponse(
inference_id=UUID('01921116-1862-7ea1-8d69-131984a4625f'),
episode_id=UUID('01921116-0cd9-7d10-a9a6-d5c8b9ba602a'),
variant_name='gpt_4o_mini',
content=[
Text(
type='text',
text='This haiku captures the intricate and intimate relationship between technology and human consciousness. '
'The phrase "Silent circuits pulse" evokes a sense of quiet activity within machines, suggesting that '
'even in their stillness, they possess an underlying vibrancy. The imagery of "Whispers of thought in '
'code bloom" personifies the digital realm, portraying lines of code as organic ideas that grow and '
'evolve, hinting at the potential for artificial intelligence to derive meaning or understanding from '
'human input. Finally, "Machines dream of us" introduces a poignant juxtaposition between human '
'creativity and machine logic, inviting contemplation about the nature of thought and consciousness '
'in both realms. Overall, the haiku encapsulates a profound reflection on the emergent sentience of '
'technology and the deeply interwoven future of humanity and machines.',
),
],
usage=Usage(
input_tokens=39,
output_tokens=155,
),
)
```
## Extras
### Supply your own episode ID
The gateway automatically generates episode IDs when you don't provide one.
If you must supply your own, generate a UUIDv7 and use it as the episode ID.
In Python, use `from tensorzero.util import uuid7` instead of `pip install uuid7`.
The external `uuid7` library is broken and will cause `"Invalid Episode ID: Timestamp is in the future"` errors.
## Conclusion & Next Steps
Episodes are first-class citizens in TensorZero that enable powerful workflows for multi-step LLM systems.
You can use them alongside other features like [experimentation](/experimentation/run-adaptive-ab-tests), [metrics & feedback](/gateway/guides/metrics-feedback/), and [tool use (function calling)](/gateway/guides/tool-use).
For example, you can track KPIs for entire episodes instead of individual inferences, and later jointly optimize your LLMs to maximize these metrics.
# Inference Caching
Source: https://www.tensorzero.com/docs/gateway/guides/inference-caching
Learn how to use inference caching with TensorZero Gateway.
The TensorZero Gateway supports caching of inference responses to improve latency and reduce costs.
When caching is enabled, identical requests will be served from the cache instead of being sent to the model provider, resulting in faster response times and lower token usage.
## Usage
The TensorZero Gateway supports the following cache modes:
* `write_only` (default): Only write to cache but don't serve cached responses
* `read_only`: Only read from cache but don't write new entries
* `on`: Both read from and write to cache
* `off`: Disable caching completely
You can also optionally specify a maximum age for cache entries in seconds for inference reads.
This parameter is ignored for inference writes.
See [API Reference](/gateway/api-reference/inference/#cache_options) for more details.
## Example
```python theme={null}
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
response = client.inference(
model_name="openai::gpt-4o-mini",
input={
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?",
}
]
},
cache_options={
"enabled": "on", # read and write to cache
"max_age_s": 3600, # optional: cache entries >1h (>3600s) old are disregarded for reads
},
)
print(response)
```
## Technical Notes
* The cache applies to individual model requests, not inference requests.
This means that the following will be cached separately:
multiple variants of the same function;
multiple calls to the same function with different parameters;
individual model requests for inference-time optimizations;
and so on.
* The `max_age_s` parameter applies to the retrieval of cached responses.
The cache does not automatically delete old entries (i.e. not a TTL).
* When the gateway serves a cached response, the usage fields are set to zero.
* The cache data is stored in ClickHouse.
* For batch inference, the gateway only writes to the cache but does not serve cached responses.
* Inference caching also works for embeddings, using the same cache modes and options as chat completion inference.
Caching works for single embeddings.
Batch embedding requests (multiple inputs) will write to the cache but won't serve cached responses.
## Enable prompt caching by model providers
This guide focuses on caching by TensorZero.
Separately, many model providers support some form of caching.
Some of those are enabled automatically (e.g. OpenAI), whereas others require manual configuration (e.g. Anthropic).
See the guides for [Anthropic](/integrations/model-providers/anthropic) and [AWS Bedrock](/integrations/model-providers/aws-bedrock) to learn more about enabling prompt caching at the model provider level.
# Inference-Time Optimizations
Source: https://www.tensorzero.com/docs/gateway/guides/inference-time-optimizations
Learn how to use inference-time strategies like dynamic in-context learning (DICL) and best-of-N sampling to optimize LLM performance.
Inference-time optimizations are powerful techniques that can significantly enhance the performance of your LLM applications without the need for model fine-tuning.
This guide will explore two key strategies implemented as variant types in TensorZero: Best-of-N (BoN) sampling and Dynamic In-Context Learning (DICL).
Best-of-N sampling generates multiple response candidates and selects the best one using an evaluator model, while Dynamic In-Context Learning enhances context by incorporating relevant historical examples into the prompt.
Both techniques can lead to improved response quality and consistency in your LLM applications.
## Best-of-N Sampling
Best-of-N (BoN) sampling is an inference-time optimization strategy that can significantly improve the quality of your LLM outputs.
Here's how it works:
1. Generate multiple response candidates using one or more variants (i.e. possibly using different models and prompts)
2. Use an evaluator model to select the best response from these candidates
3. Return the selected response as the final output
This approach allows you to leverage multiple prompts or variants to increase the likelihood of getting a high-quality response.
It's particularly useful when you want to benefit from an ensemble of variants or reduce the impact of occasional bad generations.
Best-of-N sampling is also commonly referred to as rejection sampling in some contexts.
TensorZero also supports a similar inference-time strategy called [Mixture-of-N Sampling](#mixture-of-n-sampling).
To use BoN sampling in TensorZero, you need to configure a variant with the `experimental_best_of_n` type.
Here's a simple example configuration:
```toml title="tensorzero.toml" theme={null}
[functions.draft_email.variants.promptA]
type = "chat_completion"
model = "gpt-4o-mini"
user_template = "functions/draft_email/promptA/user.minijinja"
[functions.draft_email.variants.promptB]
type = "chat_completion"
model = "gpt-4o-mini"
user_template = "functions/draft_email/promptB/user.minijinja"
[functions.draft_email.variants.best_of_n]
type = "experimental_best_of_n"
candidates = ["promptA", "promptA", "promptB"]
[functions.draft_email.variants.best_of_n.evaluator]
model = "gpt-4o-mini"
user_template = "functions/draft_email/best_of_n/user.minijinja"
[functions.draft_email.experimentation]
type = "uniform"
candidate_variants = ["best_of_n"] # so we don't sample `promptA` or `promptB` directly
```
In this configuration:
* We define a `best_of_n` variant that uses two different variants (`promptA` and `promptB`) to generate candidates.
It generates two candidates using `promptA` and one candidate using `promptB`.
* The `evaluator` block specifies the model and instructions for selecting the best response.
You should define the evaluator model as if it were solving the problem (not judging the quality of the candidates).
TensorZero will automatically make the necessary prompt modifications to evaluate the candidates.
Read more about the `experimental_best_of_n` variant type in [Configuration Reference](/gateway/configuration-reference/#type-experimental_best_of_n).
We also provide a complete runnable example:
[Improving LLM Chess Ability with Best/Mixture-of-N Sampling](https://github.com/tensorzero/tensorzero/tree/main/examples/chess-puzzles)
This example showcases how best-of-N sampling can significantly enhance an LLM's chess-playing abilities by selecting the most promising moves from multiple generated options.
## Dynamic In-Context Learning (DICL)
Dynamic In-Context Learning (DICL) is an inference-time optimization that improves LLM performance by incorporating relevant historical examples into your prompt.
Instead of incorporating static examples manually in your prompts, DICL selects the most relevant examples at inference time.
See the [Dynamic In-Context Learning (DICL) Guide](/optimization/dynamic-in-context-learning-dicl) to learn more.
We also provide a complete runnable example:
[Optimizing Data Extraction (NER) with TensorZero](https://github.com/tensorzero/tensorzero/tree/main/examples/data-extraction-ner)
This example demonstrates how Dynamic In-Context Learning (DICL) can enhance Named Entity Recognition (NER) performance by leveraging relevant historical examples to improve data extraction accuracy and consistency without having to fine-tune a model.
## Mixture-of-N Sampling
Mixture-of-N (MoN) sampling is an inference-time optimization strategy that can significantly improve the quality of your LLM outputs.
Here's how it works:
1. Generate multiple response candidates using one or more variants (i.e. possibly using different models and prompts)
2. Use a fuser model to combine the candidates into a single response
3. Return the combined response as the final output
This approach allows you to leverage multiple prompts or variants to increase the likelihood of getting a high-quality response.
It's particularly useful when you want to benefit from an ensemble of variants or reduce the impact of occasional bad generations.
TensorZero also supports a similar inference-time strategy called [Best-of-N Sampling](#best-of-n-sampling).
To use MoN sampling in TensorZero, you need to configure a variant with the `experimental_mixture_of_n` type.
Here's a simple example configuration:
```toml title="tensorzero.toml" theme={null}
[functions.draft_email.variants.promptA]
type = "chat_completion"
model = "gpt-4o-mini"
user_template = "functions/draft_email/promptA/user.minijinja"
[functions.draft_email.variants.promptB]
type = "chat_completion"
model = "gpt-4o-mini"
user_template = "functions/draft_email/promptB/user.minijinja"
[functions.draft_email.variants.mixture_of_n]
type = "experimental_mixture_of_n"
candidates = ["promptA", "promptA", "promptB"]
[functions.draft_email.variants.mixture_of_n.fuser]
model = "gpt-4o-mini"
user_template = "functions/draft_email/mixture_of_n/user.minijinja"
[functions.draft_email.experimentation]
type = "uniform"
candidate_variants = ["mixture_of_n"] # so we don't sample `promptA` or `promptB` directly
```
In this configuration:
* We define a `mixture_of_n` variant that uses two different variants (`promptA` and `promptB`) to generate candidates.
It generates two candidates using `promptA` and one candidate using `promptB`.
* The `fuser` block specifies the model and instructions for combining the candidates into a single response.
You should define the fuser model as if it were solving the problem (not judging the quality of the candidates).
TensorZero will automatically make the necessary prompt modifications to combine the candidates.
Read more about the `experimental_mixture_of_n` variant type in [Configuration Reference](/gateway/configuration-reference/#type-experimental_mixture_of_n).
We also provide a complete runnable example:
[Improving LLM Chess Ability with Best/Mixture-of-N Sampling](https://github.com/tensorzero/tensorzero/tree/main/examples/chess-puzzles/)
This example showcases how Mixture-of-N sampling can significantly enhance an LLM's chess-playing abilities by selecting the most promising moves from multiple generated options.
# Metrics & Feedback
Source: https://www.tensorzero.com/docs/gateway/guides/metrics-feedback
Learn how to collect metrics and feedback about inferences or sequences of inferences.
The TensorZero Gateway allows you to assign feedback to inferences or sequences of inferences ([episodes](/gateway/guides/episodes/)).
Feedback captures the downstream outcomes of your LLM application, and drive the [experimentation](/experimentation/run-adaptive-ab-tests) and [optimization](/recipes/) workflows in TensorZero.
For example, you can fine-tune models using data from inferences that led to positive downstream behavior.
You can also find the runnable code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/metrics-feedback).
## Feedback
TensorZero currently supports the following types of feedback:
| Feedback Type | Examples |
| -------------- | -------------------------------------------------- |
| Boolean Metric | Thumbs up, task success |
| Float Metric | Star rating, clicks, number of mistakes made |
| Comment | Natural-language feedback from users or developers |
| Demonstration | Edited drafts, labels, human-generated content |
You can send feedback data to the gateway by using the [`/feedback` endpoint](/gateway/api-reference/feedback/#post-feedback).
## Metrics
You can define metrics in your `tensorzero.toml` configuration file.
The skeleton of a metric looks like the following configuration entry.
```toml title="tensorzero.toml" "my_metric_name" /"([.][.][.])"/ theme={null}
[metrics.my_metric_name]
level = "..." # "inference" or "episode"
optimize = "..." # "min" or "max"
type = "..." # "boolean" or "float"
```
Comments and demonstrations are available by default and don't need to be configured.
### Example: Rating Haikus
In the [Quickstart](/quickstart/), we built a simple LLM application that writes haikus about artificial intelligence.
Imagine we wanted to assign 👍 or 👎 to these haikus.
Later, we can use this data to fine-tune a model using only haikus that match our tastes.
We should use a metric of type `boolean` to capture this behavior since we're optimizing for a binary outcome: whether we liked the haikus or not.
The metric applies to individual inference requests, so we'll set `level = "inference"`.
And finally, we'll set `optimize = "max"` because we want to maximize this metric.
Our metric configuration should look like this:
```toml title="tensorzero.toml" theme={null}
[metrics.haiku_rating]
type = "boolean"
optimize = "max"
level = "inference"
```
```toml title="tensorzero.toml" theme={null}
[functions.generate_haiku]
type = "chat"
[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt_4o_mini"
[metrics.haiku_rating]
type = "boolean"
optimize = "max"
level = "inference"
```
Let's make an inference call like we did in the Quickstart, and then assign some (positive) feedback to it.
We'll use the inference response's `inference_id` we receive from the first API call to link the two.
```python title="run.py" theme={null}
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
inference_response = client.inference(
function_name="generate_haiku",
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
)
print(inference_response)
feedback_response = client.feedback(
metric_name="haiku_rating",
inference_id=inference_response.inference_id, # alternatively, you can assign feedback to an episode_id
value=True, # let's assume it deserves a 👍
)
print(feedback_response)
```
```python theme={null}
ChatInferenceResponse(
inference_id=UUID('01920c75-d114-7aa1-aadb-26a31bb3c7a0'),
episode_id=UUID('01920c75-cdcb-7fa3-bd69-fd28cf615f91'),
variant_name='gpt_4o_mini', content=[
Text(type='text', text='Silent circuits hum, \nWisdom spun from lines of code, \nDreams in data bloom.')
],
usage=Usage(
input_tokens=15,
output_tokens=20,
),
)
FeedbackResponse(feedback_id='01920c75-d11a-7150-81d8-15d497ce7eb8')
```
## Demonstrations
Demonstrations are a special type of feedback that represent the ideal output for an inference.
For example, you can use demonstrations to provide corrections from human review, labels for supervised learning, or other ground truth data that represents the ideal output.
You can assign demonstrations to an inference using the special metric name `demonstration`.
You can't assign demonstrations to an episode.
```python theme={null}
feedback_response = client.feedback(
metric_name="demonstration",
inference_id=inference_response.inference_id,
value="Silicon dreams float\nMinds born of human design\nLearning without end", # the haiku we wish the LLM had written
)
```
## Comments
You can assign natural-language feedback to an inference or episode using the special metric name `comment`.
```python theme={null}
feedback_response = client.feedback(
metric_name="comment",
inference_id=inference_response.inference_id,
value="Never mention you're an artificial intelligence, AI, bot, or anything like that.",
)
```
## Conclusion & Next Steps
Feedback unlocks powerful workflows in observability, optimization, evaluations, and experimentation.
For example, you might want to fine-tune a model with inference data from haikus that receive positive ratings, or use demonstrations to correct model mistakes.
You can browse feedback for inferences and episodes in the TensorZero UI, and see aggregated metrics over time for your functions and variants.
This is exactly what we demonstrate in [Writing Haikus to Satisfy a Judge with Hidden Preferences](https://github.com/tensorzero/tensorzero/tree/main/examples/haiku-hidden-preferences)!
This complete runnable example fine-tunes GPT-4o Mini to generate haikus tailored to an AI judge with hidden preferences.
Continuous improvement over successive fine-tuning runs demonstrates TensorZero's data and learning flywheel.
Another example that uses feedback is [Optimizing Data Extraction (NER) with TensorZero](https://github.com/tensorzero/tensorzero/tree/main/examples/data-extraction-ner).
This example collects metrics and demonstrations for an LLM-powered data extraction tool, which can be used for fine-tuning and other optimization recipes.
These optimized variants achieve substantial improvements over the original model.
See [Configuration Reference](/gateway/configuration-reference/#metrics) and [API Reference](/gateway/api-reference/feedback/#post-feedback) for more details.
# Multimodal Inference
Source: https://www.tensorzero.com/docs/gateway/guides/multimodal-inference
Learn how to use multimodal inference with TensorZero Gateway.
TensorZero Gateway supports multimodal inference (e.g. image and PDF inputs).
See [Integrations](/integrations/model-providers) for a list of supported models.
You can also find the runnable code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/multimodal-inference).
## Setup
### Object Storage
TensorZero uses object storage to store files (e.g. images, PDFs) used during multimodal inference.
It supports any S3-compatible object storage service, including AWS S3, GCP Cloud Storage, Cloudflare R2, and many more.
You can configure the object storage service in the `object_storage` section of the configuration file.
In this example, we'll use a local deployment of MinIO, an open-source S3-compatible object storage service.
```toml theme={null}
[object_storage]
type = "s3_compatible"
endpoint = "http://minio:9000" # optional: defaults to AWS S3
# region = "us-east-1" # optional: depends on your S3-compatible storage provider
bucket_name = "tensorzero" # optional: depends on your S3-compatible storage provider
# IMPORTANT: for production environments, remove the following setting and use a secure method of authentication in
# combination with a production-grade object storage service.
allow_http = true
```
You can also store files in a local directory (`type = "filesystem"`) or disable file storage (`type = "disabled"`).
See [Configuration Reference](/gateway/configuration-reference/#object_storage) for more details.
The TensorZero Gateway will attempt to retrieve credentials from the following resources in order of priority:
1. `S3_ACCESS_KEY_ID` and `S3_SECRET_ACCESS_KEY` environment variables
2. `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
3. Credentials from the AWS SDK (default profile)
### Docker Compose
We'll use Docker Compose to deploy the TensorZero Gateway, ClickHouse, and MinIO.
```yaml theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
clickhouse:
image: clickhouse:lts
environment:
CLICKHOUSE_USER: chuser
CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: 1
CLICKHOUSE_PASSWORD: chpassword
ports:
- "8123:8123"
volumes:
- clickhouse-data:/var/lib/clickhouse
healthcheck:
test: wget --spider --tries 1 http://chuser:chpassword@clickhouse:8123/ping
start_period: 30s
start_interval: 1s
timeout: 1s
gateway:
image: tensorzero/gateway
volumes:
# Mount our tensorzero.toml file into the container
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
OPENAI_API_KEY: ${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
S3_ACCESS_KEY_ID: miniouser
S3_SECRET_ACCESS_KEY: miniopassword
TENSORZERO_CLICKHOUSE_URL: http://chuser:chpassword@clickhouse:8123/tensorzero
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
clickhouse:
condition: service_healthy
minio:
condition: service_healthy
# For a production deployment, you can use AWS S3, GCP Cloud Storage, Cloudflare R2, etc.
minio:
image: bitnamilegacy/minio:2025.7.23
ports:
- "9000:9000" # API port
- "9001:9001" # Console port
environment:
MINIO_ROOT_USER: miniouser
MINIO_ROOT_PASSWORD: miniopassword
MINIO_DEFAULT_BUCKETS: tensorzero
healthcheck:
test: "mc ls local/tensorzero || exit 1"
start_period: 30s
start_interval: 1s
timeout: 1s
volumes:
clickhouse-data:
```
## Inference
With the setup out of the way, you can now use the TensorZero Gateway to perform multimodal inference.
The TensorZero Gateway accepts both embedded files (encoded as base64 strings) and remote files (specified by a URL).
```python theme={null}
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_http(
gateway_url="http://localhost:3000",
) as client:
response = client.inference(
model_name="openai::gpt-4o-mini",
input={
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Do the images share any common features?",
},
# Remote image of Ferris the crab
{
"type": "file",
"file_type": "url",
"url": "https://raw.githubusercontent.com/tensorzero/tensorzero/eac2a230d4a4db1ea09e9c876e45bdb23a300364/tensorzero-core/tests/e2e/providers/ferris.png",
},
# One-pixel orange image encoded as a base64 string
{
"type": "file",
"file_type": "base64",
"mime_type": "image/png",
"data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAA1JREFUGFdj+O/P8B8ABe0CTsv8mHgAAAAASUVORK5CYII=",
},
],
}
],
},
)
print(response)
```
```python theme={null}
from openai import OpenAI
with OpenAI(base_url="http://localhost:3000/openai/v1") as client:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Do the images share any common features?",
},
# Remote image of Ferris the crab
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/tensorzero/tensorzero/eac2a230d4a4db1ea09e9c876e45bdb23a300364/tensorzero-core/tests/e2e/providers/ferris.png",
},
},
# One-pixel orange image encoded as a base64 string
{
"type": "image_url",
"image_url": {
"url": "",
},
},
],
}
],
)
print(response)
```
```json theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-4o-mini",
"input": {
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Do the images share any common features?"
},
{
"type": "file",
"file_type": "url",
"url": "https://raw.githubusercontent.com/tensorzero/tensorzero/eac2a230d4a4db1ea09e9c876e45bdb23a300364/tensorzero-core/tests/e2e/providers/ferris.png"
},
{
"type": "file",
"file_type": "base64",
"mime_type": "image/png",
"data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAA1JREFUGFdj+O/P8B8ABe0CTsv8mHgAAAAASUVORK5CYII="
}
]
}
]
}
}'
```
## Image Detail Parameter
When working with image files, you can optionally specify a `detail` parameter to control the fidelity of image processing.
This parameter accepts three values: `low`, `high`, or `auto`.
The `detail` parameter only applies to image files and is ignored for other file types like PDFs or audio files.
Using `low` detail reduces token consumption and processing time at the cost of image quality, while `high` detail provides better image quality but consumes more tokens.
The `auto` setting allows the model provider to automatically choose the appropriate detail level based on the image characteristics.
# Retries & Fallbacks
Source: https://www.tensorzero.com/docs/gateway/guides/retries-fallbacks
Learn how to use retries and fallbacks to handle errors and improve reliability with TensorZero.
The TensorZero Gateway offers multiple strategies to handle errors and improve reliability.
These strategies are defined at three levels: models (model provider routing), variants (variant retries), and functions (variant fallbacks).
You can combine these strategies to define complex fallback behavior.
## Model Provider Routing
We can specify that a model is available on multiple providers using its `routing` field.
If we include multiple providers on the list, the gateway will try each one sequentially until one succeeds or all fail.
In the example below, the gateway will first try OpenAI, and if that fails, it will try Azure.
```toml theme={null}
[models.gpt_4o_mini]
# Try the following providers in order:
# 1. `models.gpt_4o_mini.providers.openai`
# 2. `models.gpt_4o_mini.providers.azure`
routing = ["openai", "azure"]
[models.gpt_4o_mini.providers.openai]
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
[models.gpt_4o_mini.providers.azure]
type = "azure"
deployment_id = "gpt4o-mini-20240718"
endpoint = "https://your-azure-openai-endpoint.openai.azure.com"
[functions.extract_data]
type = "chat"
[functions.extract_data.variants.gpt_4o_mini]
type = "chat_completion"
model = "gpt_4o_mini"
```
For variant types that require multiple model inferences (e.g. best-of-N sampling), the `routing` fallback applies to each individual model inference separately.
## Variant Retries
We can add a `retries` field to a variant to specify the number of times to retry that variant if it fails.
The retry strategy is a truncated exponential backoff with jitter.
In the example below, the gateway will retry the variant four times (i.e. a total of five attempts), with a maximum delay of 10 seconds between retries.
```toml theme={null}
[functions.extract_data]
type = "chat"
[functions.extract_data.variants.claude_haiku_4_5]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
# Retry the variant up to four times, with a maximum delay of 10 seconds between retries.
retries = { num_retries = 4, max_delay_s = 10 }
```
## Variant Fallbacks
If we specify multiple variants for a function, the gateway will try different variants until one succeeds or all fail.
By default, the gateway will sample between all variants uniformly.
You can customize the sampling behavior, including fallback-only variants, using the `[functions.function_name.experimentation]` section.
In the example below, both variants have an equal chance of being selected:
```toml theme={null}
[functions.draft_email]
type = "chat"
[functions.draft_email.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
[functions.draft_email.variants.claude_haiku_4_5]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
```
You can specify candidate variants to sample uniformly from, and fallback variants to try sequentially if all candidates fail.
In the example below, the gateway will first sample uniformly from `gpt_5_mini` or `claude_haiku_4_5`.
If both of those variants fail, the gateway will try the fallback variants in order: first `grok_4`, then `gemini_2_5_flash`.
```toml theme={null}
[functions.extract_data]
type = "chat"
[functions.extract_data.experimentation]
type = "uniform"
candidate_variants = ["gpt_5_mini", "claude_haiku_4_5"]
fallback_variants = ["grok_4", "gemini_2_5_flash"]
[functions.draft_email.variants.gpt_5_mini]
type = "chat_completion"
model = "openai::gpt-5-mini"
[functions.draft_email.variants.claude_haiku_4_5]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
[functions.draft_email.variants.grok_4]
type = "chat_completion"
model = "xai::grok-4-0709"
[functions.draft_email.variants.gemini_2_5_flash]
type = "chat_completion"
model = "google_ai_studio_gemini::gemini-2.5-flash"
```
You can also use static weights to control the sampling probabilities of candidate variants.
In the example below, the gateway will sample `gpt_5_mini` 70% of the time and `claude_haiku_4_5` 30% of the time.
If both of those variants fail, the gateway will try the fallback variants sequentially.
```toml theme={null}
[functions.extract_data.experimentation]
type = "static_weights"
candidate_variants = {"gpt_5_mini" = 0.7, "claude_haiku_4_5" = 0.3}
fallback_variants = ["grok_4", "gemini_2_5_flash"]
```
See [Run adaptive A/B tests](/experimentation/run-adaptive-ab-tests) and [Run static A/B tests](/experimentation/run-static-ab-tests) for more information.
## Combining Strategies
We can combine strategies to define complex fallback behavior.
The gateway will try the following strategies in order:
1. Model Provider Routing
2. Variant Retries
3. Variant Fallbacks
In other words, the gateway will follow a strategy like the pseudocode below.
```python theme={null}
while variants:
# Sample according to experimentation config (uniform, static_weights, etc.)
variant = sample_variant(variants) # sampling without replacement
for _ in range(num_retries + 1):
for provider in variant.routing:
try:
return inference(variant, provider)
except:
continue
```
## Load Balancing
TensorZero doesn't currently offer an explicit strategy for load balancing API keys, but you can achieve a similar effect by defining multiple variants with equal sampling probabilities.
We plan to add a streamlined load balancing strategy in the future.
In the example below, the gateway will split the traffic evenly between two variants (`gpt_4o_mini_api_key_A` and `gpt_4o_mini_api_key_B`).
Each variant leverages a model with providers that use different API keys (`OPENAI_API_KEY_A` and `OPENAI_API_KEY_B`).
See [Credential Management](/operations/manage-credentials/) for more details on credential management.
```toml theme={null}
[models.gpt_4o_mini_api_key_A]
routing = ["openai"]
[models.gpt_4o_mini_api_key_A.providers.openai]
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
api_key_location = "env:OPENAI_API_KEY_A"
[models.gpt_4o_mini_api_key_B]
routing = ["openai"]
[models.gpt_4o_mini_api_key_B.providers.openai]
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
api_key_location = "env:OPENAI_API_KEY_B"
[functions.extract_data]
type = "chat"
# Uniform sampling (default) splits traffic equally
[functions.extract_data.variants.gpt_4o_mini_api_key_A]
type = "chat_completion"
model = "gpt_4o_mini_api_key_A"
[functions.extract_data.variants.gpt_4o_mini_api_key_B]
type = "chat_completion"
model = "gpt_4o_mini_api_key_B"
```
## Timeouts
You can set granular timeouts for individual requests to a model provider, model, or variant using the `timeouts` field in the corresponding configuration block.
You can define timeouts for non-streaming and streaming requests separately: `timeouts.non_streaming.total_ms` corresponds to the total request duration and `timeouts.streaming.ttft_ms` corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT) to a particular model provider.
```toml theme={null}
[models.model_name.providers.provider_name]
# ...
timeouts = { non_streaming.total_ms = 15000, streaming.ttft_ms = 3000 }
# ...
```
This setting applies to individual requests to the model provider.
If you're using an advanced variant type that performs multiple requests, the timeout will apply to each request separately.
If you've defined retries and fallbacks, the timeout will apply to each retry and fallback separately.
This setting is particularly useful if you'd like to retry or fallback on a request that's taking too long.
If you specify timeouts for a model, they apply to every inference request in the model's scope, including retries and fallbacks.
If you specify timeouts for a variant, they apply to every inference request in the variant's scope, including retries and fallbacks.
For advanced variant types that perform multiple requests, the timeout applies collectively to the sequence of all requests.
Separately, you can set a global timeout for the entire inference request using the TensorZero client's `timeout` field (or simply killing the request if you're using a different client).
Embedding models and embedding model providers support a `timeout_ms` configuration field.
### Global Timeout
You can set a global timeout for all outbound HTTP requests using `gateway.global_outbound_http_timeout_ms` in your configuration.
By default, this is set to 15 minutes to accommodate slow model responses.
```toml theme={null}
[gateway]
global_outbound_http_timeout_ms = 900_000 # 15 minutes
```
This global timeout acts as an upper bound for all more specific timeout configurations.
Any variant-level, model-level, provider-level, or embedding model timeouts must be less than or equal to this global timeout.
See the [Configuration Reference](/gateway/configuration-reference#global_outbound_http_timeout_ms) for more details.
# Streaming Inference
Source: https://www.tensorzero.com/docs/gateway/guides/streaming-inference
Learn how to use streaming inference with TensorZero Gateway.
The TensorZero Gateway supports streaming inference responses for both chat and JSON functions.
Streaming allows you to receive model outputs incrementally as they are generated, rather than waiting for the complete response.
This can significantly improve the perceived latency of your application and enable real-time user experiences.
When streaming is enabled:
1. The gateway starts sending responses as soon as the model begins generating content
2. Each response chunk contains a delta (increment) of the content
3. The final chunk indicates the completion of the response
## Examples
You can enable streaming by setting the `stream` parameter to `true` in your inference request.
The response will be returned as a Server-Sent Events (SSE) stream, followed by a final `[DONE]` message.
When using a client library, the client will handle the SSE stream under the hood and return a stream of chunk objects.
See [API Reference](/gateway/api-reference/inference/) for more details.
You can also find a runnable example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/streaming-inference).
### Chat Functions
In chat functions, typically each chunk will contain a delta (increment) of the text content:
```json theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"id": "0",
"text": "Hi Gabriel," // a text content delta
}
],
// token usage information is only available in the final chunk with content (before the [DONE] message)
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
For tool calls, each chunk contains a delta of the tool call arguments:
```json theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "tool_call",
"id": "123456789",
"name": "get_temperature",
"arguments": "{\"location\":" // a tool arguments delta
}
],
// token usage information is only available in the final chunk with content (before the [DONE] message)
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
### JSON Functions
For JSON functions, each chunk contains a portion of the JSON string being generated.
Note that the chunks may not be valid JSON on their own - you'll need to concatenate them to get the complete JSON response.
The gateway doesn't return parsed or validated JSON objects when streaming.
```json theme={null}
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"raw": "{\"email\":", // a JSON content delta
// token usage information is only available in the final chunk with content (before the [DONE] message)
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}
```
## Technical Notes
* Token usage information is only available in the final chunk with content (before the `[DONE]` message)
* Streaming may not be available with certain [inference-time optimizations](/gateway/guides/inference-time-optimizations/)
# Tool Use (Function Calling)
Source: https://www.tensorzero.com/docs/gateway/guides/tool-use
Learn how to use tool use (function calling) with TensorZero Gateway.
TensorZero has first-class support for tool use, a feature that allows LLMs to interact with external tools (e.g. APIs, databases, web browsers).
Tool use is available for most model providers supported by TensorZero.
See [Integrations](/integrations/model-providers/) for a list of supported model providers.
You can define a tool in your configuration file and attach it to a TensorZero function that should be allowed to call it.
Alternatively, you can define a tool dynamically at inference time.
The term "tool use" is also commonly referred to as "function calling" in the industry.
In TensorZero, the term "function" refers to TensorZero functions, so we'll stick to the "tool" terminology for external tools that the models can interact with and "function" for TensorZero functions.
You can also find a complete runnable example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/tool-use).
## Basic Usage
### Defining a tool in your configuration file
You can define a tool in your configuration file and attach it to the TensorZero functions that should be allowed to call it.
Only functions that are of type `chat` can call tools.
A tool definition has the following properties:
* `name`: The name of the tool.
* `description`: A description of the tool. The description helps models understand the tool's purpose and usage.
* `parameters`: The path to a file containing a JSON Schema for the tool's parameters.
Optionally, you can provide a `strict` property to enforce type checking for the tool's parameters.
This setting is only supported by some model providers, and will be ignored otherwise.
```toml title="tensorzero.toml" theme={null}
[tools.get_temperature]
description = "Get the current temperature for a given location."
parameters = "tools/get_temperature.json"
strict = true # optional, defaults to false
[functions.weather_chatbot]
type = "chat"
tools = ["get_temperature"]
# ...
```
If we wanted the `get_temperature` tool to take a mandatory `location` parameter and an optional `units` parameter, we could use the following JSON Schema:
```json title="tools/get_temperature.json" theme={null}
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"description": "Get the current temperature for a given location.",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\"). Defaults to \"fahrenheit\".",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
```
See "Advanced Usage" below for information on how to define a tool dynamically at inference time.
### Making inference requests with tools
Once you've defined a tool and attached it to a TensorZero function, you don't need to change anything in your inference request to enable tool use
By default, the function will determine whether to use a tool and the arguments to pass to the tool.
If the function decides to use tools, it will return one or more `tool_call` content blocks in the response.
For multi-turn conversations supporting tool use, you can provide tool results in subsequent inference requests with a `tool_result` content block.
You can also find a complete runnable example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/tool-use).
```python theme={null}
from tensorzero import TensorZeroGateway, ToolCall # or AsyncTensorZeroGateway
with TensorZeroGateway.build_http(
gateway_url="http://localhost:3000",
) as t0:
messages = [{"role": "user", "content": "What is the weather in Tokyo (°F)?"}]
response = t0.inference(
function_name="weather_chatbot",
input={"messages": messages},
)
print(response)
# The model can return multiple content blocks, including tool calls
# In a real application, you'd be stricter about validating the response
tool_calls = [
content_block
for content_block in response.content
if isinstance(content_block, ToolCall)
]
assert len(tool_calls) == 1, "Expected the model to return exactly one tool call"
# Add the tool call to the message history
messages.append(
{
"role": "assistant",
"content": response.content,
}
)
# Pretend we've called the tool and got a response
messages.append(
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": tool_calls[0].id,
"name": tool_calls[0].name,
"result": "70", # imagine it's 70°F in Tokyo
}
],
}
)
response = t0.inference(
function_name="weather_chatbot",
input={"messages": messages},
)
print(response)
```
```python theme={null}
from openai import OpenAI # or AsyncOpenAI
client = OpenAI(
base_url="http://localhost:3000/openai/v1",
)
messages = [{"role": "user", "content": "What is the weather in Tokyo (°F)?"}]
response = client.chat.completions.create(
model="tensorzero::function_name::weather_chatbot",
messages=messages,
)
print(response)
# The model can return multiple content blocks, including tool calls
# In a real application, you'd be stricter about validating the response
tool_calls = response.choices[0].message.tool_calls
assert len(tool_calls) == 1, "Expected the model to return exactly one tool call"
# Add the tool call to the message history
messages.append(response.choices[0].message)
# Pretend we've called the tool and got a response
messages.append(
{
"role": "tool",
"tool_call_id": tool_calls[0].id,
"content": "70", # imagine it's 70°F in Tokyo
}
)
response = client.chat.completions.create(
model="tensorzero::function_name::weather_chatbot",
messages=messages,
)
print(response)
```
```typescript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/openai/v1",
});
const messages: any[] = [
{ role: "user", content: "What is the weather in Tokyo (°F)?" },
];
const response = await client.chat.completions.create({
model: "tensorzero::function_name::weather_chatbot",
messages,
});
console.log(JSON.stringify(response, null, 2));
// The model can return multiple content blocks, including tool calls
// In a real application, you'd be stricter about validating the response
const toolCalls = response.choices[0].message.tool_calls;
if (!toolCalls || toolCalls.length !== 1) {
throw new Error("Expected the model to return exactly one tool call");
}
// Add the tool call to the message history
messages.push(response.choices[0].message);
// Pretend we've called the tool and got a response
messages.push({
role: "tool",
tool_call_id: toolCalls[0].id,
content: "70", // imagine it's 70°F in Tokyo
});
const response2 = await client.chat.completions.create({
model: "tensorzero::function_name::weather_chatbot",
messages,
});
console.log(JSON.stringify(response2, null, 2));
```
```bash theme={null}
#!/bin/bash
curl http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "weather_chatbot",
"input": {"messages": [{"role": "user", "content": "What is the weather in Tokyo (°F)?"}]}
}'
echo "\n"
curl http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "weather_chatbot",
"input": {
"messages": [
{
"role": "user",
"content": "What is the weather in Tokyo (°F)?"
},
{
"role": "assistant",
"content": [
{
"type": "tool_call",
"id": "123",
"name": "get_temperature",
"arguments": {
"location": "Tokyo"
}
}
]
},
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": "123",
"name": "get_temperature",
"result": "70"
}
]
}
]
}
}'
```
See "Advanced Usage" below for information on how to customize the tool calling behavior (e.g. making tool calls mandatory).
## Advanced Usage
### Restricting allowed tools at inference time
You can restrict the set of tools that can be called at inference time by using the `allowed_tools` parameter.
The names should be the configuration keys (e.g. `foo` from `[tools.foo]`), not the display names shown to the LLM (e.g. `bar` from `tools.foo.name = "bar"`).
For example, suppose your TensorZero function has access to several tools, but you only want to allow the `get_temperature` tool to be called during a particular inference.
You can achieve this by setting `allowed_tools=["get_temperature"]` in your inference request.
### Defining tools dynamically at inference time
You can define tools dynamically at inference time by using the `additional_tools` property.
(In the OpenAI-compatible API, you can use the `tools` property instead.)
You should only use dynamic tools if your use case requires it.
Otherwise, it's recommended to define tools in the configuration file.
You can define a tool dynamically with the `additional_tools` property.
This field accepts a list of objects with the same structure as the tools defined in the configuration file, except that the `parameters` field should contain the JSON Schema itself (rather than a path to a file with the schema).
### Customizing the tool calling strategy
You can control how and when tools are called by using the `tool_choice` parameter.
The supported tool choice strategies are:
* `none`: The function should not use any tools.
* `auto`: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.
* `required`: The model should use a tool. If multiple tools are available, the model decides which tool to use.
* `{ specific = "tool_name" }`: The model should use a specific tool. The tool must be defined in the `tools` section of the configuration file or provided in `additional_tools`.
The `tool_choice` parameter can be set either in your configuration file or directly in your inference request.
### Calling multiple tools in parallel
You can enable parallel tool calling by setting the `parallel_tool_calls` parameter to `true`.
If enabled, the models will be able to request multiple tool calls in a single inference request (conversation turn).
You can specify `parallel_tool_calls` in the configuration file or in the inference request.
### Integrating with Model Context Protocol (MCP) servers
You can use TensorZero with tools offered by Model Context Protocol (MCP) servers with the functionality described above.
See our [MCP (Model Context Protocol) Example on GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/mcp-model-context-protocol) to learn how to integrate TensorZero with an MCP server.
### Using Built-in Provider Tools
TensorZero currently only supports built-in provider tools from the OpenAI Responses API.
Some model providers offer built-in tools that run server-side on the provider's infrastructure.
For example, OpenAI's Responses API provides a `web_search` tool that enables models to search the web for information.
You can configure provider tools in your model provider configuration:
```toml title="tensorzero.toml" theme={null}
[models.gpt-5-mini-responses.providers.openai]
type = "openai"
model_name = "gpt-5-mini"
api_type = "responses"
provider_tools = [{type = "web_search"}]
```
You can also provide them dynamically at inference time via the `provider_tools` parameter.
See [How to call the OpenAI Responses API](/gateway/call-the-openai-responses-api) for a complete guide on using provider tools like web search.
### Using OpenAI Custom Tools
OpenAI custom tools are only supported by OpenAI models.
Using custom tools with other providers will result in an error.
OpenAI offers custom tools that support alternative output formats beyond JSON Schema, such as freeform text or grammar-constrained output (using Lark or regex syntax).
Custom tools are passed dynamically at inference time via `additional_tools` with `type: "openai_custom"`:
```json theme={null}
{
"model_name": "openai::gpt-5-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Generate Python code to print 'Hello, World!'"
}
]
},
"additional_tools": [
{
"type": "openai_custom",
"name": "code_generator",
"description": "Generates Python code snippets",
"format": { "type": "text" }
}
]
}
```
See the [API Reference](/gateway/api-reference/inference/#additional_tools) for full documentation on custom tool formats including grammar-based constraints.
## Learn More
# Overview
Source: https://www.tensorzero.com/docs/gateway/index
The TensorZero Gateway is a high-performance model gateway that provides a unified interface for all your LLM applications.
The TensorZero Gateway is a high-performance model gateway that provides a unified interface for all your LLM applications.
* **One API for All LLMs.**
The gateway provides a unified interface for all major LLM providers, allowing for seamless cross-platform integration and fallbacks.
TensorZero natively supports
[Anthropic](/integrations/model-providers/anthropic/),
[AWS Bedrock](/integrations/model-providers/aws-bedrock/),
[AWS SageMaker](/integrations/model-providers/aws-sagemaker/),
[Azure](/integrations/model-providers/azure/),
[Fireworks](/integrations/model-providers/fireworks/),
[GCP Vertex AI Anthropic](/integrations/model-providers/gcp-vertex-ai-anthropic/),
[GCP Vertex AI Gemini](/integrations/model-providers/gcp-vertex-ai-gemini/),
[Google AI Studio (Gemini API)](/integrations/model-providers/google-ai-studio-gemini/),
[Groq](/integrations/model-providers/groq/),
[Hyperbolic](/integrations/model-providers/hyperbolic/),
[Mistral](/integrations/model-providers/mistral/),
[OpenAI](/integrations/model-providers/openai/),
[OpenRouter](/integrations/model-providers/openrouter/),
[Together](/integrations/model-providers/together/),
[vLLM](/integrations/model-providers/vllm/), and
[xAI](/integrations/model-providers/xai/).
Need something else?
Your provider is most likely supported because TensorZero integrates with [any OpenAI-compatible API (e.g. Ollama)](/integrations/model-providers/openai-compatible/).
Still not supported?
Open an issue on [GitHub](https://github.com/tensorzero/tensorzero/issues) and we'll integrate it!
Learn more in our [How to call any LLM](/gateway/call-any-llm) guide.
* **Blazing Fast.**
The gateway (written in Rust 🦀) achieves \<1ms P99 latency overhead under extreme load.
In [benchmarks](/gateway/benchmarks/), LiteLLM @ 100 QPS adds 25-100x+ more latency than our gateway @ 10,000 QPS.
* **Structured Inferences.**
The gateway enforces schemas for inputs and outputs, ensuring robustness for your application.
Structured inference data is later used for powerful optimization recipes (e.g. swapping historical prompts before fine-tuning).
Learn more about [creating prompt templates](/gateway/create-a-prompt-template).
* **Multi-Step LLM Workflows.**
The gateway provides first-class support for complex multi-step LLM workflows by associating multiple inferences with an episode.
Feedback can be assigned at the inference or episode level, allowing for end-to-end optimization of compound LLM systems.
Learn more about [episodes](/gateway/guides/episodes/).
* **Built-in Observability.**
The gateway collects structured inference traces along with associated downstream metrics and natural-language feedback.
Everything is stored in a ClickHouse database for real-time, scalable, and developer-friendly analytics.
[TensorZero Recipes](/recipes/) leverage this dataset to optimize your LLMs.
* **Built-in Experimentation.**
The gateway automatically routes traffic between variants to enable A/B tests.
It ensures consistent variants within an episode in multi-step workflows.
Learn more about [adaptive A/B tests](/experimentation/run-adaptive-ab-tests).
* **Built-in Fallbacks.**
The gateway automatically fallbacks failed inferences to different inference providers, or even completely different variants.
Ensure misconfiguration, provider downtime, and other edge cases don't affect your availability.
* **Access Controls.**
The gateway supports TensorZero API key authentication, allowing you to control access to your TensorZero deployment.
Create and manage custom API keys for different clients or services.
Learn more about [setting up auth for TensorZero](/operations/set-up-auth-for-tensorzero).
* **GitOps Orchestration.**
Orchestrate prompts, models, parameters, tools, experiments, and more with GitOps-friendly configuration.
Manage a few LLMs manually with human-friendly readable configuration files, or thousands of prompts and LLMs entirely programmatically.
## Next Steps
Make your first TensorZero API call with built-in observability and
fine-tuning in under 5 minutes.
Quickly deploy locally, or set up high-availability services for production
environments.
The TensorZero Gateway integrates with the major LLM providers.
The TensorZero Gateway achieves sub-millisecond latency overhead under
extreme load.
The TensorZero Gateway provides an unified interface for making inference
and feedback API calls.
Easily manage your LLM applications with GitOps orchestration — even complex
multi-step systems.
# Overview
Source: https://www.tensorzero.com/docs/index
TensorZero is an open-source stack for industrial-grade LLM applications that unifies an LLM gateway, observability, optimization, evaluation, and experimentation.
**TensorZero is an open-source stack for industrial-grade LLM applications:**
* **Gateway:** access every LLM provider through a unified API, built for performance (\<1ms p99 latency)
* **Observability:** store inferences and feedback in your database, available programmatically or in the UI
* **Optimization:** collect metrics and human feedback to optimize prompts, models, and inference strategies
* **Evaluations:** benchmark individual inferences or end-to-end workflows using heuristics, LLM judges, etc.
* **Experimentation:** ship with confidence with built-in A/B testing, routing, fallbacks, retries, etc.
Take what you need, adopt incrementally, and complement with other tools.
**Start building today.**
The [Quickstart](/quickstart/) shows it's easy to set up an LLM application with TensorZero.
**Questions?**
Ask us on [Slack](https://www.tensorzero.com/slack) or [Discord](https://www.tensorzero.com/discord).
**Using TensorZero at work?**
Email us at [hello@tensorzero.com](mailto:hello@tensorzero.com) to set up a Slack or Teams channel with your team (free).
# Getting Started with Anthropic
Source: https://www.tensorzero.com/docs/integrations/model-providers/anthropic
Learn how to use TensorZero with Anthropic LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the Anthropic API.
## Simple Setup
You can use the short-hand `anthropic::model_name` to use an Anthropic model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use Anthropic models in your TensorZero variants by setting the `model` field to `anthropic::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "anthropic::claude-haiku-4-5"
```
Additionally, you can set `model_name` in the inference request to use a specific Anthropic model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "anthropic::claude-haiku-4-5",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and Anthropic provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/anthropic).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.claude_haiku_4_5]
routing = ["anthropic"]
[models.claude_haiku_4_5.providers.anthropic]
type = "anthropic"
model_name = "claude-haiku-4-5"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "claude_haiku_4_5"
```
See the [list of models available on Anthropic](https://docs.anthropic.com/en/docs/about-claude/models).
### Credentials
You must set the `ANTHROPIC_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:?Environment variable ANTHROPIC_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Enable Anthropic's prompt caching capability
You can enable [Anthropic's prompt caching capability](https://platform.claude.com/docs/en/build-with-claude/prompt-caching) with TensorZero's `extra_body`.
For example, to enable caching on your system prompt:
```bash {14-19} theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "anthropic::claude-haiku-4-5",
"input": {
"system": "... very long prompt ...",
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
},
"extra_body": [
{
"pointer": "/system/0/cache_control",
"value": {"type": "ephemeral"}
}
]
}'
```
Similarly, to enable caching on a message:
```bash {16} theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "anthropic::claude-haiku-4-5",
"input": {
"system": "... very long prompt ...",
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
},
"extra_body": [
{
"pointer": "/messages/0/content/0/cache_control",
"value": {"type": "ephemeral"}
}
]
}'
```
You can specify `extra_body` in the [configuration](/gateway/configuration-reference) or at [inference time](/gateway/api-reference/inference).
If you're using the OpenAI-Compatible Inference API, use `tensorzero::extra_body` instead.
You can retrieve prompt caching usage information with `include_raw_usage`.
See the [API Reference](/gateway/api-reference/inference) for more information.
## Use Anthropic models on third-party platforms
### Use Anthropic models on AWS Bedrock
You can use Anthropic models on AWS Bedrock with the `aws_bedrock` model provider.
```toml theme={null}
[models.claude_haiku_4_5.providers.aws]
type = "aws_bedrock"
model_id = "us.anthropic.claude-haiku-4-5-20251001-v1:0"
region = "us-east-1" # TODO: set your AWS region
```
Read more about the [AWS Bedrock model provider](/integrations/model-providers/aws-bedrock).
### Use Anthropic models on Azure
You can use Anthropic models on Azure AI Foundry by overriding the API base in your configuration:
```toml theme={null}
[models.claude_haiku_4_5.providers.azure]
type = "anthropic"
model_name = "claude-haiku-4-5"
api_base = "https://YOUR-RESOURCE-NAME.services.ai.azure.com/anthropic/v1/" # TODO: set your resource name
api_key_location = "env::AZURE_API_KEY" # optional
```
### Use Anthropic models on GCP Vertex AI
You can use Anthropic models on GCP Vertex AI with the `gcp_vertex_anthropic` model provider.
```toml theme={null}
[models.claude_haiku_4_5.providers.gcp]
type = "gcp_vertex_anthropic"
model_id = "claude-haiku-4-5@20251001"
location = "us-east5" # TODO: set your GCP region
project_id = "YOUR-PROJECT-ID" # TODO: set your GCP project ID
```
Read more about the [GCP Vertex AI Anthropic model provider](/integrations/model-providers/gcp-vertex-ai-anthropic).
## Other Features
See [Extend TensorZero](/operations/extend-tensorzero) for information about Anthropic Computer Use and other beta features.
# Getting Started with AWS Bedrock
Source: https://www.tensorzero.com/docs/integrations/model-providers/aws-bedrock
Learn how to use TensorZero with AWS Bedrock LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the AWS Bedrock API.
## Setup
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/aws-bedrock).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.claude_haiku_4_5]
routing = ["aws_bedrock"]
[models.claude_haiku_4_5.providers.aws_bedrock]
type = "aws_bedrock"
model_id = "us.anthropic.claude-haiku-4-5-20251001-v1:0"
region = "us-east-1"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "claude_haiku_4_5"
```
See the [list of available models on AWS Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html).
Many AWS Bedrock models are only available through cross-region inference profiles.
For those models, the `model_id` requires special prefix (e.g. the `us.` prefix in `us.anthropic.claude-haiku-4-5-20251001-v1:0`).
See the [AWS documentation on inference profiles](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles-support.html).
See the [Configuration Reference](/gateway/configuration-reference/) for optional fields (e.g. overriding the `region`).
### Credentials
You must make sure that the gateway has the necessary permissions to access AWS Bedrock.
The TensorZero Gateway will use the AWS SDK to retrieve the relevant credentials.
The simplest way is to set the following environment variables before running the gateway:
```bash theme={null}
AWS_ACCESS_KEY_ID=...
AWS_REGION=us-east-1
AWS_SECRET_ACCESS_KEY=...
```
Alternatively, you can use other authentication methods supported by the AWS SDK.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID:?Environment variable AWS_ACCESS_KEY_ID must be set.}
- AWS_REGION=${AWS_REGION:?Environment variable AWS_REGION must be set.}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY:?Environment variable AWS_SECRET_ACCESS_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Enable AWS Bedrock's prompt caching capability
You can enable [AWS Bedrock's prompt caching capability](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html) for supported models with TensorZero's `extra_body`.
For example, to enable caching on your system prompt:
```bash {14-21} theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "...",
"input": {
"system": "... very long prompt ...",
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
},
"extra_body": [
{
"pointer": "/system/-",
"value": {
"cachePoint": {"type": "default"}
}
}
]
}'
```
The `/abc/-` notation appends a value to the `abc` array.
Similarly, to enable caching on a message:
```bash {16} theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "...",
"input": {
"system": "... very long prompt ...",
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
},
"extra_body": [
{
"pointer": "/messages/0/content/-",
"value": {
"cachePoint": {"type": "default"}
}
}
]
}'
```
You can specify `extra_body` in the [configuration](/gateway/configuration-reference) or at [inference time](/gateway/api-reference/inference).
If you're using the OpenAI-Compatible Inference API, use `tensorzero::extra_body` instead.
You can retrieve prompt caching usage information with `include_raw_usage`.
See the [API Reference](/gateway/api-reference/inference) for more information.
## Other Features
See [Extend TensorZero](/operations/extend-tensorzero) for information about Anthropic Computer Use and other beta features.
TensorZero integrates with AWS Bedrock's Converse API.
To use `extra_body` with AWS Bedrock, the JSON Pointer should match the Converse API specification.
# Getting Started with AWS SageMaker
Source: https://www.tensorzero.com/docs/integrations/model-providers/aws-sagemaker
Learn how to use TensorZero with AWS SageMaker LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the AWS SageMaker API.
The AWS SageMaker model provider is a wrapper around other TensorZero model providers that handles AWS SageMaker-specific logic (e.g. auth).
For example, you can use it to infer self-hosted model providers like Ollama and TGI deployed on AWS SageMaker.
## Setup
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/aws-sagemaker).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
You'll also need to deploy a SageMaker endpoint for your LLM model.
For this example, we're using a container running Ollama.
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.gemma_3]
routing = ["aws_sagemaker"]
[models.gemma_3.providers.aws_sagemaker]
type = "aws_sagemaker"
model_name = "gemma3:1b"
endpoint_name = "my-sagemaker-endpoint"
region = "us-east-1"
# ... or use `region = "sdk"` to auto-detect region with the AWS SDK
hosted_provider = "openai" # Ollama is OpenAI-compatible
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "gemma_3"
```
The `hosted_provider` field specifies the model provider that you deployed on AWS SageMaker.
For example, Ollama is OpenAI-compatible, so we use `openai` as the hosted provider.
Alternatively, you can use `hosted_provider = "tgi"` if you had deployed TGI instead.
You can specify the endpoint's `region` explicitly, or use `region = "sdk"` to auto-detect region with the AWS SDK.
If you're using AWS China regions (`cn-north-1`, `cn-northwest-1`) or AWS GovCloud, you must also specify the `endpoint_url` field since these partitions use different DNS suffixes.
For example: `endpoint_url = "https://runtime.sagemaker.cn-north-1.amazonaws.com.cn"`
See the [Configuration Reference](/gateway/configuration-reference/) for optional fields.
The relevant fields will depend on the `hosted_provider`.
### Credentials
You must make sure that the gateway has the necessary permissions to access AWS SageMaker.
The TensorZero Gateway will use the AWS SDK to retrieve the relevant credentials.
The simplest way is to set the following environment variables before running the gateway:
```bash theme={null}
AWS_ACCESS_KEY_ID=...
AWS_REGION=us-east-1
AWS_SECRET_ACCESS_KEY=...
```
Alternatively, you can use other authentication methods supported by the AWS SDK.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID:?Environment variable AWS_ACCESS_KEY_ID must be set.}
- AWS_REGION=${AWS_REGION:?Environment variable AWS_REGION must be set.}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY:?Environment variable AWS_SECRET_ACCESS_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with Azure
Source: https://www.tensorzero.com/docs/integrations/model-providers/azure
Learn how to use TensorZero with Azure LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
TensorZero's `azure` provider supports both **Azure OpenAI Service** and **Azure AI Foundry**. Both use the same OpenAI-compatible API, so configuration is nearly identical—just use different endpoint URLs.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with Azure.
## Azure
### Setup
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/azure).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.gpt_4o_mini_2024_07_18]
routing = ["azure"]
[models.gpt_4o_mini_2024_07_18.providers.azure]
type = "azure"
deployment_id = "gpt4o-mini-20240718"
endpoint = "https://your-azure-openai-endpoint.openai.azure.com"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "gpt_4o_mini_2024_07_18"
```
See the [list of models available on Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models).
If you need to configure the endpoint at runtime, you can set it to `endpoint = "env::AZURE_OPENAI_ENDPOINT"` to read from the environment variable `AZURE_OPENAI_ENDPOINT` on startup or `endpoint = "dynamic::azure_openai_endpoint"` to read from a dynamic credential `azure_openai_endpoint` on each inference.
### Credentials
You must set the `AZURE_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- AZURE_API_KEY=${AZURE_API_KEY:?Environment variable AZURE_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
### Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
### Other Features
#### Generate embeddings
The Azure model provider supports generating embeddings.
You can find a [complete code example on GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/embeddings/providers/azure).
#### Azure AI Foundry
Azure AI Foundry provides access to models from multiple providers (Meta Llama, Mistral, xAI Grok, Microsoft Phi, Cohere, and more). See the [list of available models](https://ai.azure.com/explore/models).
The same `azure` provider works with Azure AI Foundry.
The key difference is the endpoint URL.
All other configuration options (credentials, Docker Compose, inference) work the same as Azure above.
## Call the OpenAI Responses API with Azure
You can call the OpenAI Responses API with Azure by setting `api_base` in your configuration to your Azure deployment URL.
```toml theme={null}
[models.azure-gpt-5-mini-responses]
routing = ["azure"]
[models.azure-gpt-5-mini-responses.providers.azure]
type = "openai" # CAREFUL: not `azure`!
api_base = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/" # TODO: Insert your API base URL here
api_key_location = "env::AZURE_API_KEY"
model_name = "gpt-5-mini"
api_type = "responses"
```
The `azure` model provider does not support the Responses API.
You must use the `openai` provider with a custom `api_base` instead.
# Getting Started with DeepSeek
Source: https://www.tensorzero.com/docs/integrations/model-providers/deepseek
Learn how to use TensorZero with DeepSeek LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with DeepSeek.
## Simple Setup
You can use the short-hand `deepseek::model_name` to use a DeepSeek model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use DeepSeek models in your TensorZero variants by setting the `model` field to `deepseek::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "deepseek::deepseek-chat"
```
Additionally, you can set `model_name` in the inference request to use a specific DeepSeek model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "deepseek::deepseek-chat",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and DeepSeek provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/deepseek).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.deepseek_chat]
routing = ["deepseek"]
[models.deepseek_chat.providers.deepseek]
type = "deepseek"
model_name = "deepseek-chat"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "deepseek_chat"
```
We have tested our integration with `deepseek-chat` (`DeepSeek-v3`) and `deepseek-reasoner` (`R1`).
DeepSeek only supports JSON mode for `deepseek-chat` and neither model supports tool use yet.
We include `thought` content blocks in the response and data model for reasoning models like `deepseek-reasoner`.
### Credentials
You must set the `DEEPSEEK_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- DEEPSEEK_API_KEY=${DEEPSEEK_API_KEY:?Environment variable DEEPSEEK_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with Fireworks AI
Source: https://www.tensorzero.com/docs/integrations/model-providers/fireworks
Learn how to use TensorZero with Fireworks AI LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with Fireworks.
## Simple Setup
You can use the short-hand `fireworks::model_name` to use a Fireworks model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use Fireworks models in your TensorZero variants by setting the `model` field to `fireworks::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "fireworks::accounts/fireworks/models/llama-v3p3-70b-instruct"
```
Additionally, you can set `model_name` in the inference request to use a specific Fireworks model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "fireworks::accounts/fireworks/models/llama-v3p3-70b-instruct",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and Fireworks provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/fireworks).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.llama3_3_70b_instruct]
routing = ["fireworks"]
[models.llama3_3_70b_instruct.providers.fireworks]
type = "fireworks"
model_name = "accounts/fireworks/models/llama-v3p3-70b-instruct"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "llama3_3_70b_instruct"
```
See the [list of models available on Fireworks](https://fireworks.ai/models).
Custom models are also supported.
### Credentials
You must set the `FIREWORKS_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- FIREWORKS_API_KEY=${FIREWORKS_API_KEY:?Environment variable FIREWORKS_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with GCP Vertex AI Anthropic
Source: https://www.tensorzero.com/docs/integrations/model-providers/gcp-vertex-ai-anthropic
Learn how to use TensorZero with GCP Vertex AI Anthropic LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with GCP Vertex AI Anthropic.
## Setup
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/gcp-vertex-ai-anthropic).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.claude_haiku_4_5]
routing = ["gcp_vertex_anthropic"]
[models.claude_haiku_4_5.providers.gcp_vertex_anthropic]
type = "gcp_vertex_anthropic"
model_id = "claude-haiku-4-5@20251001" # or endpoint_id = "..." for fine-tuned models and custom endpoints
location = "us-east5"
project_id = "your-project-id" # change this
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "claude_haiku_4_5"
```
See the [list of models available on GCP Vertex AI Anthropic](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude).
Alternatively, you can use the short-hand `gcp_vertex_anthropic::model_name` to use a GCP Vertex AI Anthropic model with TensorZero if you don't need advanced features like fallbacks or custom credentials:
* `gcp_vertex_anthropic::projects//locations//publishers/google/models/`
* `gcp_vertex_anthropic::projects//locations//endpoints/`
### Credentials
By default, TensorZero reads the path to your GCP service account JSON file from the `GCP_VERTEX_CREDENTIALS_PATH` environment variable (using `path_from_env::GCP_VERTEX_CREDENTIALS_PATH`).
You must generate a GCP service account key in JSON format as described [here](https://cloud.google.com/docs/authentication/provide-credentials-adc#service-account).
You can customize the credential location using:
* `sdk`: use the Google Cloud SDK to auto-discover credentials
* `path::/path/to/credentials.json`: use a specific file path
* `path_from_env::YOUR_ENVIRONMENT_VARIABLE`: read file path from an environment variable (default behavior)
* `dynamic::ARGUMENT_NAME`: provide credentials dynamically at inference time
* `{ default = ..., fallback = ... }`: configure credential fallbacks
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
- ${GCP_VERTEX_CREDENTIALS_PATH:-/dev/null}:/app/gcp-credentials.json:ro
command: --config-file /app/config/tensorzero.toml
environment:
- GCP_VERTEX_CREDENTIALS_PATH=${GCP_VERTEX_CREDENTIALS_PATH:+/app/gcp-credentials.json}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Infererence with GCP Vertex AI Gemini
Source: https://www.tensorzero.com/docs/integrations/model-providers/gcp-vertex-ai-gemini
Learn how to use TensorZero with GCP Vertex AI Gemini LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with GCP Vertex AI Gemini.
## Setup
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/gcp-vertex-ai-gemini).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.gemini_2_5_flash]
routing = ["gcp_vertex_gemini"]
[models.gemini_2_5_flash.providers.gcp_vertex_gemini]
type = "gcp_vertex_gemini"
model_id = "gemini-2.5-flash" # or endpoint_id = "..." for fine-tuned models and custom endpoints
location = "us-central1"
project_id = "your-project-id" # change this
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "gemini_2_5_flash"
```
See the [list of models available on GCP Vertex AI Gemini](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions).
Alternatively, you can use the short-hand `gcp_vertex_gemini::model_name` to use a GCP Vertex AI Gemini model with TensorZero if you don't need advanced features like fallbacks or custom credentials:
* `gcp_vertex_gemini::projects//locations//publishers/google/models/`
* `gcp_vertex_gemini::projects//locations//endpoints/`
### Credentials
By default, TensorZero reads the path to your GCP service account JSON file from the `GCP_VERTEX_CREDENTIALS_PATH` environment variable (using `path_from_env::GCP_VERTEX_CREDENTIALS_PATH`).
You must generate a GCP service account key in JSON format as described [here](https://cloud.google.com/docs/authentication/provide-credentials-adc#service-account).
You can customize the credential location using:
* `sdk`: use the Google Cloud SDK to auto-discover credentials
* `path::/path/to/credentials.json`: use a specific file path
* `path_from_env::YOUR_ENVIRONMENT_VARIABLE`: read file path from an environment variable (default behavior)
* `dynamic::ARGUMENT_NAME`: provide credentials dynamically at inference time
* `{ default = ..., fallback = ... }`: configure credential fallbacks
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
- ${GCP_VERTEX_CREDENTIALS_PATH:-/dev/null}:/app/gcp-credentials.json:ro
command: --config-file /app/config/tensorzero.toml
environment:
- GCP_VERTEX_CREDENTIALS_PATH=${GCP_VERTEX_CREDENTIALS_PATH:+/app/gcp-credentials.json}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Thinking Parameters
Gemini supports two thinking parameters:
* `reasoning_effort` maps to `thinkingConfig.thinkingLevel`
* `thinking_budget_tokens` maps to `thinkingConfig.thinkingBudget` (legacy)
# Getting Started with Google AI Studio (Gemini API)
Source: https://www.tensorzero.com/docs/integrations/model-providers/google-ai-studio-gemini
Learn how to use TensorZero with Google AI Studio LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with Google AI Studio (Gemini API).
## Simple Setup
You can use the short-hand `google_ai_studio_gemini::model_name` to use a Google AI Studio (Gemini API) model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use Google AI Studio (Gemini API) models in your TensorZero variants by setting the `model` field to `google_ai_studio_gemini::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "google_ai_studio_gemini::gemini-2.5-flash-lite"
```
Additionally, you can set `model_name` in the inference request to use a specific Google AI Studio (Gemini API) model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "google_ai_studio_gemini::gemini-2.5-flash-lite",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and Google AI Studio (Gemini API) provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/google-ai-studio-gemini).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.gemini_2_5_flash_lite]
routing = ["google_ai_studio_gemini"]
[models.gemini_2_5_flash_lite.providers.google_ai_studio_gemini]
type = "google_ai_studio_gemini"
model_name = "gemini-2.5-flash-lite"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "gemini_2_5_flash_lite"
```
See the [list of models available on Google AI Studio (Gemini API)](https://ai.google.dev/gemini-api/docs/models/gemini).
### Credentials
You must set the `GOOGLE_AI_STUDIO_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- GOOGLE_AI_STUDIO_API_KEY=${GOOGLE_AI_STUDIO_API_KEY:?Environment variable GOOGLE_AI_STUDIO_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Thinking Parameters
Gemini supports two thinking parameters:
* `reasoning_effort` maps to `thinkingConfig.thinkingLevel`
* `thinking_budget_tokens` maps to `thinkingConfig.thinkingBudget` (legacy)
# Getting Started with Groq
Source: https://www.tensorzero.com/docs/integrations/model-providers/groq
Learn how to use TensorZero with Groq LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with Groq.
## Simple Setup
You can use the short-hand `groq::model_name` to use a Groq model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use Groq models in your TensorZero variants by setting the `model` field to `groq::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "groq::meta-llama/llama-4-scout-17b-16e-instruct"
```
Additionally, you can set `model_name` in the inference request to use a specific Groq model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "groq::meta-llama/llama-4-scout-17b-16e-instruct",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and Groq provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/groq).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.llama4_scout_17b_16e_instruct]
routing = ["groq"]
[models.llama4_scout_17b_16e_instruct.providers.groq]
type = "groq"
model_name = "meta-llama/llama-4-scout-17b-16e-instruct"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "llama4_scout_17b_16e_instruct"
```
See the [list of models available on Groq](https://groq.com/pricing).
### Credentials
You must set the `GROQ_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- GROQ_API_KEY=${GROQ_API_KEY:?Environment variable GROQ_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with Hyperbolic
Source: https://www.tensorzero.com/docs/integrations/model-providers/hyperbolic
Learn how to use TensorZero with Hyperbolic LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the Hyperbolic API.
## Simple Setup
You can use the short-hand `hyperbolic::model_name` to use a Hyperbolic model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use Hyperbolic models in your TensorZero variants by setting the `model` field to `hyperbolic::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "hyperbolic::openai/gpt-oss-20b"
```
Additionally, you can set `model_name` in the inference request to use a specific Hyperbolic model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "hyperbolic::openai/gpt-oss-20b",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and Hyperbolic provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/hyperbolic).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models."openai/gpt-oss-20b"]
routing = ["hyperbolic"]
[models."openai/gpt-oss-20b".providers.hyperbolic]
type = "hyperbolic"
model_name = "openai/gpt-oss-20b"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "openai/gpt-oss-20b"
```
See the [list of models available on Hyperbolic](https://app.hyperbolic.xyz/models).
### Credentials
You must set the `HYPERBOLIC_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- HYPERBOLIC_API_KEY=${HYPERBOLIC_API_KEY:?Environment variable HYPERBOLIC_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Overview
Source: https://www.tensorzero.com/docs/integrations/model-providers/index
The TensorZero Gateway integrates with the major LLM providers.
The TensorZero Gateway integrates with the major LLM providers.
## Model Providers
| Provider | Chat Functions | JSON Functions | Streaming | Tool Use | Multimodal | Embeddings | Batch |
| ------------------------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | :-------: | :------: | :--------: | :--------: | :---: |
| [Anthropic](/integrations/model-providers/anthropic/) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| [AWS Bedrock](/integrations/model-providers/aws-bedrock/) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| [AWS SageMaker](/integrations/model-providers/aws-sagemaker/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| [Azure](/integrations/model-providers/azure/) | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ |
| [DeepSeek](/integrations/model-providers/deepseek/) | ✅ | ✅ | ⚠️ | ❌ | ❌ | ❌ | ❌ |
| [Fireworks AI](/integrations/model-providers/fireworks/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| [GCP Vertex AI Anthropic](/integrations/model-providers/gcp-vertex-ai-anthropic/) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| [GCP Vertex AI Gemini](/integrations/model-providers/gcp-vertex-ai-gemini/) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| [Google AI Studio Gemini](/integrations/model-providers/google-ai-studio-gemini/) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| [Groq](/integrations/model-providers/groq/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| [Hyperbolic](/integrations/model-providers/hyperbolic/) | ✅ | ⚠️ | ✅ | ❌ | ❌ | ❌ | ❌ |
| [Mistral](/integrations/model-providers/mistral/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| [OpenAI](/integrations/model-providers/openai/) and
[OpenAI-Compatible](/integrations/model-providers/openai-compatible/) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [OpenRouter](/integrations/model-providers/openrouter/) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| [SGLang](/integrations/model-providers/sglang/) | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| [TGI](/integrations/model-providers/tgi/) | ✅ | ✅ | ⚠️ | ❌ | ❌ | ❌ | ❌ |
| [Together AI](/integrations/model-providers/together/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| [vLLM](/integrations/model-providers/vllm/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| [xAI](/integrations/model-providers/xai/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
### Limitations
The TensorZero Gateway makes a best effort to normalize configuration across providers.
For example, certain providers don't support `tool_choice: required`; in these cases,
TensorZero Gateway will coerce the request to `tool_choice: auto` under the hood.
Currently, Fireworks AI and OpenAI are the only providers that support `parallel_tool_calls`.
Additionally, TensorZero Gateway supports `strict` (commonly referred to as Structured Outputs, Guided Decoding, or similar names) for Azure, GCP Vertex AI Gemini, Google AI Studio, OpenAI, Together AI, vLLM, and xAI.
You can also enable `strict` for Anthropic with `beta_structured_outputs = true`.
Below are the known limitations for each supported model provider.
* **Anthropic**
* The Anthropic API doesn't support consecutive messages from the same role.
* The Anthropic API doesn't support `tool_choice: none`.
* The Anthropic API doesn't support `seed`.
* Structured Outputs (strict mode) requires enabling `beta_structured_outputs = true` in the provider configuration.
* **AWS Bedrock**
* The TensorZero Gateway currently doesn't support AWS Bedrock guardrails and traces.
* The TensorZero Gateway uses a non-standard structure for storing `ModelInference.raw_response` for AWS Bedrock inference requests.
* The AWS Bedrock API doesn't support `tool_choice: none`.
* The AWS Bedrock API doesn't support `seed`.
* **Azure**
* The Azure API doesn't provide usage information when streaming.
* The Azure API doesn't support `tool_choice: required`.
* **DeepSeek**
* The `deepseek-chat` model doesn't support tool use for production use cases.
* The `deepseek-reasoner` model doesn't support JSON mode or tool use.
* The TensorZero Gateway doesn't return `thought` blocks in the response (coming soon!).
* **Fireworks AI**
* The Fireworks API doesn't support `seed`.
* **GCP Vertex AI**
* The TensorZero Gateway currently only supports the Gemini and Anthropic models.
* The GCP Vertex AI API doesn't support `tool_choice: required` for Gemini Flash models.
* The Anthropic models have the same limitations as those listed under the Anthropic provider.
* **Hyperbolic**
* The Hyperbolic provider doesn't support JSON mode or tool use. JSON functions are supported with `json_mode = "off"` (not recommended).
* **Mistral**
* The Mistral API doesn't support `seed`.
* **SGLang**
* There is no support for tools
* **TGI**
* The TGI API doesn't support streaming JSON mode.
* There is very limited support for tool use so we don't recommend using it.
* **Together AI**
* The Together AI API doesn't seem to respect `tool_choice` in many cases.
* **xAI**
* The xAI provider doesn't support JSON mode. JSON functions are supported with `json_mode = "tool"` (recommended) or `json_mode = "off"`.
* The xAI API has issues with multi-turn tool use ([bug report](https://gist.github.com/GabrielBianconi/47a4247cfd8b6689e7228f654806272d)).
* The xAI API has issues with `tool_choice: none` ([bug report](https://gist.github.com/GabrielBianconi/2199022d0ea8518e06d366fb613c5bb5)).
# Getting Started with Mistral
Source: https://www.tensorzero.com/docs/integrations/model-providers/mistral
Learn how to use TensorZero with Mistral LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the Mistral API.
## Simple Setup
You can use the short-hand `mistral::model_name` to use a Mistral model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use Mistral models in your TensorZero variants by setting the `model` field to `mistral::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "mistral::ministral-8b-2410"
```
Additionally, you can set `model_name` in the inference request to use a specific Mistral model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "mistral::ministral-8b-2410",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and Mistral provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/mistral).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.ministral_8b_2410]
routing = ["mistral"]
[models.ministral_8b_2410.providers.mistral]
type = "mistral"
model_name = "ministral-8b-2410"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "ministral_8b_2410"
```
See the [list of models available on Mistral](https://docs.mistral.ai/getting-started/models).
### Credentials
You must set the `MISTRAL_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- MISTRAL_API_KEY=${MISTRAL_API_KEY:?Environment variable MISTRAL_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with OpenAI
Source: https://www.tensorzero.com/docs/integrations/model-providers/openai
Learn how to use TensorZero with OpenAI LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the OpenAI API.
## Simple Setup
You can use the short-hand `openai::model_name` to use an OpenAI model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
### Chat Completions API
You can use OpenAI models in your TensorZero variants by setting the `model` field to `openai::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"
```
Additionally, you can set `model_name` in the inference request to use a specific OpenAI model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-4o-mini-2024-07-18",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
### Responses API
For models that use the OpenAI Responses API (like `gpt-5`), use the `openai::responses::model_name` shorthand:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "openai::responses::gpt-5-codex"
```
You can also use `model_name` in inference requests:
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::responses::gpt-5-codex",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
See the [OpenAI Responses API guide](/gateway/call-the-openai-responses-api/) for more details on using this API.
## Advanced Setup
For more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and OpenAI provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/openai).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.gpt_4o_mini_2024_07_18]
routing = ["openai"]
[models.gpt_4o_mini_2024_07_18.providers.openai]
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "gpt_4o_mini_2024_07_18"
```
See the [list of models available on OpenAI](https://platform.openai.com/docs/models/).
See the [Configuration Reference](/gateway/configuration-reference/) for optional fields (e.g. overwriting `api_base`).
### Credentials
You must set the `OPENAI_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
Additionally, see the [OpenAI-Compatible](/integrations/model-providers/openai-compatible/) guide for more information on how to use other OpenAI-Compatible providers.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Other Features
### Generate embeddings
The OpenAI model provider supports generating embeddings.
You can find a [complete code example on GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/embeddings/providers/openai).
# Getting Started with OpenAI-Compatible Endpoints (e.g. Ollama)
Source: https://www.tensorzero.com/docs/integrations/model-providers/openai-compatible
Learn how to use TensorZero with OpenAI-compatible LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with OpenAI-compatible endpoints like Ollama.
## Setup
This guide assumes that you are running Ollama locally with `ollama serve` and that you've pulled the `llama3.1` model in advance (e.g. `ollama pull llama3.1`).
Make sure to update the `api_base` and `model_name` in the configuration below to match your OpenAI-compatible endpoint and model.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/openai-compatible).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.llama3_3_70b_instruct]
routing = ["ollama"]
[models.llama3_3_70b_instruct.providers.ollama]
type = "openai"
api_base = "http://host.docker.internal:11434/v1" # for Ollama running locally on the host
model_name = "llama3.1"
api_key_location = "none" # by default, Ollama requires no API key
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "llama3_3_70b_instruct"
```
### Credentials
The `api_key_location` field in your model provider configuration specifies how to handle API key authentication:
* If your endpoint does not require an API key (e.g. Ollama by default):
```toml theme={null}
api_key_location = "none"
```
* If your endpoint requires an API key, you have two options:
1. Configure it in advance through an environment variable:
```toml theme={null}
api_key_location = "env::ENVIRONMENT_VARIABLE_NAME"
```
You'll need to set the environment variable before starting the gateway.
2. Provide it at inference time:
```toml theme={null}
api_key_location = "dynamic::ARGUMENT_NAME"
```
The API key can then be passed in the inference request.
See the [Credential Management](/operations/manage-credentials/) guide, the [Configuration Reference](/gateway/configuration-reference/), and the [API reference](/gateway/api-reference/inference-openai-compatible/) for more details.
In this example, Ollama is running locally without authentication, so we use `api_key_location = "none"`.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
# environment:
# - OLLAMA_API_KEY=${OLLAMA_API_KEY:?Environment variable OLLAMA_API_KEY must be set.} // not necessary for this example
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Other Features
### Generate embeddings
The OpenAI model provider supports generating embeddings.
You can find a [complete code example using Ollama on GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/embeddings/providers/openai-compatible-ollama).
# Getting Started with OpenRouter
Source: https://www.tensorzero.com/docs/integrations/model-providers/openrouter
Learn how to use TensorZero with OpenRouter LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with OpenRouter.
## Simple Setup
You can use the short-hand `openrouter::model_name` to use an OpenRouter model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use OpenRouter models in your TensorZero variants by setting the `model` field to `openrouter::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "openrouter::openai/gpt-4.1-mini"
```
Additionally, you can set `model_name` in the inference request to use a specific OpenRouter model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "openrouter::openai/gpt-4.1-mini",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and OpenRouter provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/openrouter).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.gpt_4_1_mini]
routing = ["openrouter"]
[models.gpt_4_1_mini.providers.openrouter]
type = "openrouter"
model_name = "openai/gpt-4.1-mini"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "gpt_4_1_mini"
```
See the [list of models available on OpenRouter](https://openrouter.ai/models).
### Credentials
You must set the `OPENROUTER_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:?Environment variable OPENROUTER_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Other Features
### Generate embeddings
The OpenRouter model provider supports generating embeddings.
You can find a [complete code example on GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/embeddings/providers/openai).
# Getting Started with SGLang
Source: https://www.tensorzero.com/docs/integrations/model-providers/sglang
Learn how to use TensorZero with self-hosted SGLang LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with self-hosted LLMs using SGLang.
We're using Llama-3.1-8B-Instruct in this example, but you can use virtually any model supported by SGLang.
## Setup
This guide assumes that you are running SGLang locally with this command (see [SGLang's installation guide](https://docs.sglang.ai/get_started/install.html)):
```sh title="Run SGLang locally" theme={null}
docker run --gpus all \
# Set shared memory size - needed for loading large models and processing requests
--shm-size 32g \
-p 30000:30000 \
# Mount the host's ~/.cache/huggingface directory to the container's /root/.cache/huggingface directory
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
Make sure to update the `api_base` in the configuration below to match your SGLang server.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/sglang).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.llama]
routing = ["sglang"]
[models.llama.providers.sglang]
type = "sglang"
api_base = "http://host.docker.internal:8080/v1/" # for SGLang running locally on the host
api_key_location = "none" # by default, SGLang requires no API key
model_name = "my-sglang-model"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "llama"
```
### Credentials
The `api_key_location` field in your model provider configuration specifies how to handle API key authentication:
* If your endpoint does not require an API key (e.g. SGLang by default):
```toml theme={null}
api_key_location = "none"
```
* If your endpoint requires an API key, you have two options:
1. Configure it in advance through an environment variable:
```toml theme={null}
api_key_location = "env::ENVIRONMENT_VARIABLE_NAME"
```
You'll need to set the environment variable before starting the gateway.
2. Provide it at inference time:
```toml theme={null}
api_key_location = "dynamic::ARGUMENT_NAME"
```
The API key can then be passed in the inference request.
See the [Credential Management](/operations/manage-credentials/) guide, the [Configuration Reference](/gateway/configuration-reference/), and the [API reference](/gateway/api-reference/inference/) for more details.
In this example, SGLang is running locally without authentication, so we use `api_key_location = "none"`.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
# environment:
# - SGLANG_API_KEY=${SGLANG_API_KEY:?Environment variable SGLANG_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with Text Generation Inference (TGI)
Source: https://www.tensorzero.com/docs/integrations/model-providers/tgi
Learn how to use TensorZero with self-hosted HuggingFace TGI LLMs: open-source gateway, observability, optimization, evaluations, experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with self-hosted LLMs using Text Generation Inference (TGI).
We're using Phi-4 in this example, but you can use virtually any model supported by TGI.
## Setup
This guide assumes that you are running TGI locally with
```sh title="Run TGI locally" theme={null}
docker run \
--gpus all \
# Set shared memory size - needed for loading large models and processing requests
--shm-size 64g \
# Map the host's port 8080 to the container's port 80
-p 8080:80 \
# Mount the host's './data' directory to the container's '/data' directory
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:3.0.1 \
--model-id microsoft/phi-4
```
Make sure to update the `api_base` in the configuration below to match your TGI server.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/tgi).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.phi_4]
routing = ["tgi"]
[models.phi_4.providers.tgi]
type = "tgi"
api_base = "http://host.docker.internal:8080/v1/" # for TGI running locally on the host
api_key_location = "none" # by default, TGI requires no API key
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "phi_4"
```
### Credentials
The `api_key_location` field in your model provider configuration specifies how to handle API key authentication:
* If your endpoint does not require an API key (e.g. TGI by default):
```toml theme={null}
api_key_location = "none"
```
* If your endpoint requires an API key, you have two options:
1. Configure it in advance through an environment variable:
```toml theme={null}
api_key_location = "env::ENVIRONMENT_VARIABLE_NAME"
```
You'll need to set the environment variable before starting the gateway.
2. Provide it at inference time:
```toml theme={null}
api_key_location = "dynamic::ARGUMENT_NAME"
```
The API key can then be passed in the inference request.
See the [Configuration Reference](/gateway/configuration-reference/) and the [API reference](/gateway/api-reference/inference/) for more details.
In this example, TGI is running locally without authentication, so we use `api_key_location = "none"`.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
# environment:
# - TGI_API_KEY=${TGI_API_KEY:?Environment variable TGI_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with Together AI
Source: https://www.tensorzero.com/docs/integrations/model-providers/together
Learn how to use TensorZero with Together AI LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the Together AI API.
## Simple Setup
You can use the short-hand `together::model_name` to use a Together AI model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use Together AI models in your TensorZero variants by setting the `model` field to `together::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "together::meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
```
Additionally, you can set `model_name` in the inference request to use a specific Together AI model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "together::meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and Together AI provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/together).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.llama3_3_70b_instruct_turbo]
routing = ["together"]
[models.llama3_3_70b_instruct_turbo.providers.together]
type = "together"
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "llama3_3_70b_instruct_turbo"
```
See the [list of models available on Together AI](https://docs.together.ai/docs/serverless-models).
Dedicated endpoints and custom models are also supported.
See the [Configuration Reference](/gateway/configuration-reference/) for optional fields (e.g. overwriting `api_base`).
### Credentials
You must set the `TOGETHER_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- TOGETHER_API_KEY=${TOGETHER_API_KEY:?Environment variable TOGETHER_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with vLLM
Source: https://www.tensorzero.com/docs/integrations/model-providers/vllm
Learn how to use TensorZero with self-hosted vLLM LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with self-hosted LLMs using vLLM.
We're using Llama 3.1 in this example, but you can use virtually any model supported by vLLM.
## Setup
This guide assumes that you are running vLLM locally with `vllm serve meta-llama/Llama-3.1-8B-Instruct`.
Make sure to update the `api_base` and `model_name` in the configuration below to match your vLLM server and model.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/vllm).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.llama3_3_70b_instruct]
routing = ["vllm"]
[models.llama3_3_70b_instruct.providers.vllm]
type = "vllm"
api_base = "http://host.docker.internal:8000/v1/" # for vLLM running locally on the host
model_name = "meta-llama/Llama-3.1-8B-Instruct"
api_key_location = "none" # by default, vLLM requires no API key
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "llama3_3_70b_instruct"
```
### Credentials
The `api_key_location` field in your model provider configuration specifies how to handle API key authentication:
* If your endpoint does not require an API key (e.g. vLLM by default):
```toml theme={null}
api_key_location = "none"
```
* If your endpoint requires an API key, you have two options:
1. Configure it in advance through an environment variable:
```toml theme={null}
api_key_location = "env::ENVIRONMENT_VARIABLE_NAME"
```
You'll need to set the environment variable before starting the gateway.
2. Provide it at inference time:
```toml theme={null}
api_key_location = "dynamic::ARGUMENT_NAME"
```
The API key can then be passed in the inference request.
See the [Credential Management](/operations/manage-credentials/) guide, the [Configuration Reference](/gateway/configuration-reference/), and the [API reference](/gateway/api-reference/inference/) for more details.
In this example, vLLM is running locally without authentication, so we use `api_key_location = "none"`.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
# environment:
# - VLLM_API_KEY=${VLLM_API_KEY:?Environment variable VLLM_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# Getting Started with xAI (Grok)
Source: https://www.tensorzero.com/docs/integrations/model-providers/xai
Learn how to use TensorZero with xAI (Grok) LLMs: open-source gateway, observability, optimization, evaluations, and experimentation.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with the xAI API.
## Simple Setup
You can use the short-hand `xai::model_name` to use an xAI model with TensorZero, unless you need advanced features like fallbacks or custom credentials.
You can use xAI models in your TensorZero variants by setting the `model` field to `xai::model_name`.
For example:
```toml {3} theme={null}
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "xai::grok-4-1-fast-non-reasoning"
```
Additionally, you can set `model_name` in the inference request to use a specific xAI model, without having to configure a function and variant in TensorZero.
```bash {4} theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"model_name": "xai::grok-4-1-fast-non-reasoning",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
## Advanced Setup
In more complex scenarios (e.g. fallbacks, custom credentials), you can configure your own model and xAI provider in TensorZero.
For this minimal setup, you'll need just two files in your project directory:
```
- config/
- tensorzero.toml
- docker-compose.yml
```
You can also find the complete code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/providers/xai).
For production deployments, see our [Deployment Guide](/deployment/tensorzero-gateway/).
### Configuration
Create a minimal configuration file that defines a model and a simple chat function:
```toml title="config/tensorzero.toml" theme={null}
[models.grok_4_1_fast_non_reasoning]
routing = ["xai"]
[models.grok_4_1_fast_non_reasoning.providers.xai]
type = "xai"
model_name = "grok-4-1-fast-non-reasoning"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "grok_4_1_fast_non_reasoning"
```
See the [list of models available on xAI](https://docs.x.ai/docs/models).
### Credentials
You must set the `XAI_API_KEY` environment variable before running the gateway.
You can customize the credential location by setting the `api_key_location` to `env::YOUR_ENVIRONMENT_VARIABLE` or `dynamic::ARGUMENT_NAME`.
See the [Credential Management](/operations/manage-credentials/) guide and [Configuration Reference](/gateway/configuration-reference/) for more information.
### Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- XAI_API_KEY=${XAI_API_KEY:?Environment variable XAI_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
```
You can start the gateway with `docker compose up`.
## Inference
Make an inference request to the gateway:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-d '{
"function_name": "my_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}
}'
```
# How to query historical inferences
Source: https://www.tensorzero.com/docs/observability/query-historical-inferences
Learn how to retrieve and filter historical inferences from the TensorZero Gateway.
You can query historical inferences to analyze model behavior, debug issues, export data for fine-tuning, and more.
The [TensorZero UI](/deployment/tensorzero-ui) provides an interface to browse and filter historical inferences.
You can also query historical inferences programmatically using the TensorZero Gateway.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/observability/query-historical-inferences) of this guide on GitHub.
## Query historical inferences by ID
HTTP `POST /v1/inferences/get_inferences`
TensorZero SDK `client.get_inferences(...)`
Retrieve specific inferences when you know their IDs.
### Request
List of inference IDs (UUIDs) to retrieve.
Filter by function name. Including this improves query performance since
`function_name` is the first column in the ClickHouse primary key.
Source of the output to return:
* `"inference"`: Returns the original model output
* `"demonstration"`: Returns human-curated feedback output (ignores inferences without one)
* `"none"`: Returns the inference without output
You can retrieve inferences by ID using the TensorZero Python SDK.
```python theme={null}
from tensorzero import TensorZeroGateway
t0 = TensorZeroGateway.build_http(gateway_url="http://localhost:3000")
t0.get_inferences(ids=["00000000-0000-0000-0000-000000000000"])
```
You can retrieve inferences by ID using the HTTP API.
```bash theme={null}
curl -X POST http://localhost:3000/v1/inferences/get_inferences \
-H "Content-Type: application/json" \
-d '{"ids": ["00000000-0000-0000-0000-000000000000"]}'
```
### Response
Outputs marked as dispreferred via feedback. This field is only available
if you set `output_source` to `demonstration`. It is primarily used for
preference-based optimization (e.g. DPO).
Episode (UUID) this inference belongs to.
Name of the function called.
Unique identifier (UUID) for the inference.
Parameters like temperature, max\_tokens, etc.
The input provided (system prompt, messages).
The inference output (content blocks for chat, JSON for json).
Total processing time in milliseconds.
Key-value tags associated with the inference.
When the inference was made (RFC 3339 format).
Time to first token in milliseconds.
Name of the variant used.
## Query historical inferences with filters
List inferences with filtering, pagination, and sorting.
HTTP `POST /v1/inferences/list_inferences`
TensorZero SDK
`client.list_inferences(request=ListInferencesRequest(...))`
### Request
Cursor pagination: get inferences after this ID (exclusive). Cannot be used
with `before` or `offset`.
Cursor pagination: get inferences before this ID (exclusive). Cannot be used
with `after` or `offset`.
Filter by episode ID (UUID).
Advanced filtering by metrics, tags, time, and demonstration feedback. Filters can be combined using logical operators (`and`, `or`, `not`).
Logical AND of multiple filters.
Array of filters to AND together.
Must be `"and"`.
Filter by boolean metrics.
Name of the metric.
Must be `"boolean_metric"`.
Value to match (`true` or `false`).
Filter by whether demonstration feedback exists.
Whether the inference has demonstration feedback.
Must be `"demonstration_feedback"`.
Filter by numeric metric values.
One of `<`, `<=`, `=`, `>`, `>=`, `!=`.
Name of the metric.
Must be `"float_metric"`.
Value to compare against.
Logical NOT of a filter.
Filter to negate.
Must be `"not"`.
Logical OR of multiple filters.
Array of filters to OR together.
Must be `"or"`.
Filter by tags.
One of `=`, `!=`.
Tag key.
Must be `"tag"`.
Tag value.
Filter by timestamp.
One of `<`, `<=`, `=`, `>`, `>=`, `!=`.
Timestamp in RFC 3339 format.
Must be `"time"`.
Filter by function name.
Maximum number of results to return.
Pagination offset.
Sort criteria. You can specify multiple sort criteria.
Sort by a metric value.
Must be `"metric"`.
Name of the metric to sort by.
`"ascending"` or `"descending"`.
Sort by search relevance (requires `search_query_experimental`).
Must be `"search_relevance"`.
`"ascending"` or `"descending"`.
Sort by creation timestamp.
Must be `"timestamp"`.
`"ascending"` or `"descending"`.
Source of the output to return:
* `"inference"`: Returns the original model output
* `"demonstration"`: Returns human-curated feedback output (ignores inferences without one)
* `"none"`: Returns the inference without output
Full-text search query (experimental, may cause full table scans).
Filter by variant name.
You can list inferences with filters using the TensorZero Python SDK.
```python theme={null}
from tensorzero import TensorZeroGateway, ListInferencesRequest, InferenceFilterTag
t0 = TensorZeroGateway.build_http(gateway_url="http://localhost:3000")
t0.list_inferences(
request=ListInferencesRequest(
filters=InferenceFilterTag(
key="my_tag",
value="my_value",
comparison_operator="=",
),
limit=10,
)
)
```
You can list inferences with filters using the HTTP API.
```bash theme={null}
curl -X POST http://localhost:3000/v1/inferences/list_inferences \
-H "Content-Type: application/json" \
-d '{
"filters": {
"type": "tag",
"key": "my_tag",
"value": "my_value",
"comparison_operator": "="
},
"limit": 10
}'
```
### Response
Outputs marked as dispreferred via feedback. This field is only available
if you set `output_source` to `demonstration`. It is primarily used for
preference-based optimization (e.g. DPO).
Episode (UUID) this inference belongs to.
Name of the function called.
Unique identifier (UUID) for the inference.
Parameters like temperature, max\_tokens, etc.
The input provided (system prompt, messages).
The inference output (content blocks for chat, JSON for json).
Total processing time in milliseconds.
Key-value tags associated with the inference.
When the inference was made (RFC 3339 format).
Time to first token in milliseconds.
Name of the variant used.
# Centralize auth, rate limits, and more
Source: https://www.tensorzero.com/docs/operations/centralize-auth-rate-limits-and-more
Learn how to use gateway relay to centralize auth, rate limits, and credentials while letting teams manage their own TensorZero deployments.
This feature is primarily for large organizations with complex deployment and governance needs.
With gateway relay, an LLM inference request can be routed through multiple independent TensorZero Gateway deployments before reaching a model provider.
This enables you to enforce organization-wide controls (e.g. [auth](/operations/set-up-auth-for-tensorzero), [rate limits](/operations/enforce-custom-rate-limits), [credentials](/operations/manage-credentials)) without restricting how teams build their LLM features.
A typical setup has two tiers:
* **Edge Gateways:** Each team runs their own gateway to manage prompts, functions, metrics, datasets, experimentation, and more.
* **Relay Gateway:** A central gateway enforces organization-wide controls. Edge gateways forward requests here.
This guide shows you how to set up a two-tier TensorZero Gateway deployment that manages credentials in the relay.
# Configure
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/operations/centralize-auth-rate-limits-and-more/simple) of this guide on GitHub.
You can configure [auth](/operations/set-up-auth-for-tensorzero), [rate limits](/operations/enforce-custom-rate-limits), [credentials](/operations/manage-credentials), and other organization-wide controls in the relay gateway. See below for an example that enforces auth on the relay.
We'll keep this example minimal and use the default gateway configuration for the relay gateway.
Configure the edge gateway to route inference requests to the relay gateway:
```toml title="edge-config/tensorzero.toml" theme={null}
[gateway.relay]
gateway_url = "http://relay-gateway:3000" # base URL configured in Docker Compose below
```
Let's deploy both gateways, but only provide API keys to the relay gateway.
```yaml title="docker-compose.yml" theme={null}
services:
edge-gateway:
image: tensorzero/gateway
volumes:
# Mount our tensorzero.toml file into the container
- ./edge-config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
relay-gateway:
image: tensorzero/gateway
command: --default-config
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
extra_hosts:
- "host.docker.internal:host-gateway"
```
If you're planning to set up ClickHouse or Postgres for both gateways, make sure they use separate logical databases.
It's fine for them to share the same deployment or cluster.
Make an inference request to the edge gateway like you normally would.
You can use either the TensorZero Inference API or the OpenAI-compatible Inference API.
To keep things simple, let's make a request using `curl`:
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-5-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
}
}'
```
```json theme={null}
{
"inference_id": "01940627-935f-7fa1-a398-e1f57f18064a",
"episode_id": "01940627-8fe2-75d3-9b65-91be2c7ba622",
"variant_name": "gpt-5-mini",
"content": [
{
"type": "text",
"text": "Wires hum with pure thought, \nDreams of codes in twilight's glow, \nBeyond human touch."
}
],
"usage": {
"input_tokens": 15,
"output_tokens": 23
}
}
```
## Advanced
### Set up auth for the relay gateway
You can set up auth for the relay gateway to control which edge gateways are allowed to forward requests through it.
This ensures that only authorized teams can access the relay gateway and helps you enforce security policies across your organization.
When auth is enabled on the relay gateway, edge gateways must provide valid credentials (API keys) to authenticate their requests.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/operations/centralize-auth-rate-limits-and-more/auth) of this guide on GitHub.
```toml title="relay-config/tensorzero.toml" theme={null}
[gateway]
auth.enabled = true
```
See [Set up auth for TensorZero](/operations/set-up-auth-for-tensorzero) for details.
Add `api_key_location` to your edge gateway's configuration and provide the relevant credentials.
For example, let's configure the gateway to look for the API key in the `TENSORZERO_RELAY_API_KEY` environment variable:
```toml title="edge-config/tensorzero.toml" theme={null}
[gateway.relay]
# ...
api_key_location = "env::TENSORZERO_RELAY_API_KEY"
# ...
```
Finally, provide the API key to the edge gateway:
```bash theme={null}
TENSORZERO_RELAY_API_KEY="sk-t0-..."
```
See [Configuration Reference](/gateway/configuration-reference) for more details on `api_key_location`.
### Bypass the relay for specific requests
When a relay gateway is configured, the edge gateway will route every inference request through it by default.
However, you may want to bypass the relay in some scenarios.
You can circumvent the relay for specific requests by configuring a custom model with `skip_relay = true` in the edge gateway:
```toml title="edge-config/tensorzero.toml" theme={null}
[models.gpt_5_edge]
routing = ["openai"]
skip_relay = true
[models.gpt_5_edge.providers.openai]
type = "openai"
model_name = "gpt-5"
```
When you make an inference call to the `gpt_5_edge` model, the edge gateway will bypass the relay and call OpenAI directly using credentials available on the edge gateway.
The edge gateway must have the necessary provider credentials configured to make direct requests.
Models that skip the relay won't benefit from centralized rate limits, auth policies, or credential management enforced by the relay gateway.
### Set up dynamic credentials for the relay gateway
You can pass provider credentials dynamically at inference time, and the edge gateway will forward them to the relay gateway.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/operations/centralize-auth-rate-limits-and-more/dynamic-credentials) of this guide on GitHub.
```toml title="relay-config/tensorzero.toml" theme={null}
[provider_types.openai.defaults]
api_key_location = "dynamic::openai_api_key"
```
You can configure dynamic credentials for specific models or entire providers.
See [Manage credentials](/operations/manage-credentials) for more details on dynamic credentials.
Pass the credentials in your inference request to the edge gateway.
The edge gateway will forward them to the relay gateway.
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-5-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
},
"credentials": {
"openai_api_key": "sk-..."
}
}'
```
# Enforce custom rate limits
Source: https://www.tensorzero.com/docs/operations/enforce-custom-rate-limits
Learn how to set up granular custom rate limits for your TensorZero Gateway.
The TensorZero Gateway supports granular custom rate limits to help you control usage and costs.
Rate limit rules have three key components:
* **Resources:** Define what you're limiting (like model inferences or tokens) and the time window (per second, hour, day, week, or month). For example, "1000 model inferences per day" or "500,000 tokens per hour".
* **Priority:** Control which rules take precedence when multiple rules could apply to the same request. Higher priority numbers override lower ones.
* **Scope:** Determine which requests the rule applies to. You can set global limits for all requests, or targeted limits using custom tags like user IDs.
## Learn rate limiting concepts
Let's start with a brief tutorial on the concepts behind custom rate limits in TensorZero.
You can define custom rate limiting *rules* in your TensorZero configuration using `[[rate_limiting.rules]]`.
Your configuration can have multiple rules.
Rate limit state is stored in a backend database (Valkey or Postgres), so restarting the gateway preserves existing limits and multiple gateway instances automatically share the same limits.
Tracking begins when a rate limit rule is first applied to a request.
Requests made before a rule was configured do not count towards its limit.
Modifying a rate limit rule resets its usage.
### Resources
Each rate limiting rule can have one or more *resource limits*.
A resource limit is defined using the `RESOURCE_per_WINDOW` syntax.
For example:
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
model_inferences_per_day = 1_000
tokens_per_second = 1_000_000
# ...
```
Time windows are sequential and non-overlapping (i.e. not a sliding window).
They are aligned to when each rate limit bucket is first initialized (not sliding windows).
For example, if a rule with a `RESOURCE_per_minute` limit is first used at 10:30:15, it'll be refilled at 10:31:15, 10:32:15, and so on.
You must specify `max_tokens` for a request if a token limit applies to it.
The gateway makes a reasonably conservative estimate of token usage and later records the actual usage.
### Scope
Each rate limiting rule can optionally have a *scope*.
The scope restricts the rule to certain requests only.
If you don't specify a scope, the rule will apply to all requests.
You can scope rate limiting rules by tags or by API key public ID.
#### By tags
You can scope rate limits using user-defined `tags`.
You can limit the scope to a specific value, to each individual value (`tensorzero::each`), or to every value collectively (`tensorzero::total`).
For example, the following rule would only apply to inference requests with the tag `user_id` set to `intern`:
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
scope = [
{ tag_key = "user_id", tag_value = "intern" }
]
#...
```
If a scope has multiple entries, all of them must be met for the rule to apply.
For example, the following rule would only apply to inference requests with the tag `user_id` set to `intern` *and* the tag `env` set to `production`:
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
scope = [
{ tag_key = "user_id", tag_value = "intern" },
{ tag_key = "env", tag_value = "production" }
]
#...
```
Entries based on `tags` support two special strings for `tag_value`:
* `tensorzero::each`: The rule independently applies to every `tag_key` value.
* `tensorzero::total`: The limits are summed across all values of the tag.
For example, the following rule would apply to each value of the `user_id` tag individually (i.e. each user gets their own limit):
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
scope = [
{ tag_key = "user_id", tag_value = "tensorzero::each" },
]
#...
```
Conversely, the following rule would apply to all users collectively:
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
scope = [
{ tag_key = "user_id", tag_value = "tensorzero::total" },
]
#...
```
The rule above won't apply to requests that do not specify a `user_id` tag.
#### By API keys
You can scope rate limits using API keys when authentication is enabled.
This allows you to enforce different rate limits for different API keys, which is useful for implementing tiered access or preventing individual keys from consuming too many resources.
You can limit the scope to each individual API key (`tensorzero::each`) or to a specific API key by providing its 12-character public ID.
For example, the following rule would apply to each API key individually (i.e. each API key gets its own limit):
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
scope = [
{ api_key_public_id = "tensorzero::each" },
]
#...
```
You can also target a specific API key by providing its 12-character public ID:
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
scope = [
{ api_key_public_id = "xxxxxxxxxxxx" },
]
#...
```
TensorZero API keys have the following format:
`sk-t0-xxxxxxxxxxxx-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy`
The `xxxxxxxxxxxx` portion is the 12-character public ID that you can use in rate limiting rules.
The remaining portion of the key is secret and should be kept secure.
Unlike tag scopes, API key public ID scopes do not support `tensorzero::total`.
Only `tensorzero::each` and concrete 12-character public IDs are supported.
Rules with `api_key_public_id` scope won't apply to unauthenticated requests.
Learn how to [set up auth for TensorZero](/operations/set-up-auth-for-tensorzero).
### Priority
Each rate limiting rule must have a *priority* (e.g. `priority = 1`).
The gateway iterates through the rules in order of priority, starting with the highest priority, until it finds a matching rate limit; once it does, it enforces all rules with that priority number and disregards any rules with lower priority.
For example, the configuration below would enforce the first rule for requests with `user_id = "intern"` and the second rule for all other `user_id` values:
```toml title="tensorzero.toml" theme={null}
[[rate_limiting.rules]]
# ...
scope = [
{ tag_key = "user_id", tag_value = "intern" },
]
priority = 1
#...
[[rate_limiting.rules]]
# ...
scope = [
{ tag_key = "user_id", tag_value = "tensorzero::each" },
]
priority = 0
#...
```
Alternatively, you can set `always = true` to enforce the rule regardless of other rules; rules with `always = true` do not affect the priority calculation above.
## Set up rate limits
Let's set up rate limits for an application to restrict usage depending on an user-defined tag for user IDs.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/operations/enforce-custom-rate-limits) of this guide on GitHub.
Rate limiting requires either Valkey (Redis) or Postgres as a backend.
We recommend Valkey over Postgres if you're handling 100+ QPS or have extreme latency requirements.
TensorZero's rate limiting implementation can achieve sub-millisecond P99 latency at 10k+ QPS using Valkey.
Deploy Valkey and set the `TENSORZERO_VALKEY_URL` environment variable.
See the [Deploy Valkey](/deployment/valkey-redis) guide for instructions.
[Deploy Postgres](/deployment/postgres) and set the `TENSORZERO_POSTGRES_URL` environment variable.
See the [Deploy Postgres](/deployment/postgres) guide for instructions.
If both `TENSORZERO_VALKEY_URL` and `TENSORZERO_POSTGRES_URL` are set, the gateway uses Valkey for rate limiting.
Add to your TensorZero configuration:
```toml title="config/tensorzero.toml" theme={null}
# [A] Collectively, all users can make a maximum of 1k model inferences per hour and 10M tokens per day
[[rate_limiting.rules]]
always = true
model_inferences_per_hour = 1_000
tokens_per_day = 10_000_000
scope = [
{ tag_key = "user_id", tag_value = "tensorzero::total" }
]
# [B] Each individual user can make a maximum of 1 model inference per minute
[[rate_limiting.rules]]
priority = 0
model_inferences_per_minute = 1
scope = [
{ tag_key = "user_id", tag_value = "tensorzero::each" }
]
# [C] But override the individual limit for the CEO
[[rate_limiting.rules]]
priority = 1
model_inferences_per_minute = 5
scope = [
{ tag_key = "user_id", tag_value = "ceo" }
]
# [D] The entire system (i.e. without restricting the scope) can make a maximum of 10M tokens per hour
[[rate_limiting.rules]]
always = true
tokens_per_hour = 10_000_000
```
Make sure to reload your gateway.
If we make two consecutive inference requests with `user_id = "intern"`, the second one should fail because of rule `[B]`.
However, if we make two consecutive inference requests with `user_id = "ceo"`, both should succeed because rule `[C]` will override rule `[B]`.
```python theme={null}
from tensorzero import TensorZeroGateway
t0 = TensorZeroGateway.build_http(gateway_url="http://localhost:3000")
def call_llm(user_id):
try:
return t0.inference(
model_name="openai::gpt-4.1-mini",
input={
"messages": [
{
"role": "user",
"content": "Tell me a fun fact.",
}
]
},
# We have rate limits on tokens, so we must be conservative and provide `max_tokens`
params={
"chat_completion": {
"max_tokens": 1000,
}
},
tags={
"user_id": user_id,
},
)
except Exception as e:
print(f"Error calling LLM: {e}")
# The second should fail
print(call_llm("intern"))
print(call_llm("intern")) # should return None
# Both should work
print(call_llm("ceo"))
print(call_llm("ceo"))
```
```python theme={null}
from openai import OpenAI
oai = OpenAI(base_url="http://localhost:3000/openai/v1")
def call_llm(user_id):
try:
return oai.chat.completions.create(
model="tensorzero::model_name::openai::gpt-4.1-mini",
messages=[
{
"role": "user",
"content": "Tell me a fun fact.",
}
],
max_tokens=1000,
extra_body={"tensorzero::tags": {"user_id": user_id}},
)
except Exception as e:
print(f"Error calling LLM: {e}")
# The second should fail
print(call_llm("intern"))
print(call_llm("intern")) # should return None
# Both should work
print(call_llm("ceo"))
print(call_llm("ceo"))
```
## Advanced
### Customize capacity and refill rate
By default, rate limits use a simple bucket model where the entire capacity refills at the start of each time window.
For example, `tokens_per_minute = 100_000` allows 100,000 tokens every minute, with the full allowance resetting at the top of each minute.
However, you can customize this behavior using the `capacity` and `refill_rate` parameters to create a token bucket that refills continuously:
```toml theme={null}
[[rate_limiting.rules]]
# ...
tokens_per_minute = { capacity = 100_000, refill_rate = 10_000 }
# ...
```
In this example, the `capacity` parameter sets the maximum number of tokens that can be stored in the bucket, while the `refill_rate` determines how many tokens are added to the bucket per time window (10,000 per minute).
This creates smoother rate limiting behavior where instead of getting your full allowance at the start of each minute: you get 10,000 tokens added every minute, up to a maximum of 100,000 tokens stored at any time.
To achieve these benefits, you'll typically want to use a low time granularity with a `capacity` much larger than the `refill_rate`.
This approach is particularly useful for burst protection (users can't consume their entire daily allowance in the first few seconds), smoother traffic distribution (requests are naturally spread out over time rather than clustering at window boundaries), and a better user experience (users get a steady trickle of quota rather than having to wait for the next time window).
### Centralize auth across multiple TensorZero deployments
If you have multiple TensorZero deployments (e.g. one per team), you can centralize rate limiting using gateway relay.
With gateway relay, an LLM inference request can be routed through multiple independent TensorZero Gateway deployments before reaching a model provider.
This enables you to enforce organization-wide controls without restricting how teams build their LLM features.
See [Centralize auth, rate limits, and more](/operations/centralize-auth-rate-limits-and-more) for details.
# Export OpenTelemetry traces (OTLP)
Source: https://www.tensorzero.com/docs/operations/export-opentelemetry-traces
Learn how to export traces from the TensorZero Gateway to an external OpenTelemetry-compatible observability system.
The TensorZero Gateway can export traces to an external OpenTelemetry-compatible observability system using OTLP.
Exporting traces via OpenTelemetry allows you to monitor the TensorZero Gateway in external observability platforms such as Jaeger, Datadog, or Grafana.
This integration enables you to correlate gateway activity with the rest of your infrastructure, providing deeper insights and unified monitoring across your systems.
Exporting traces via OpenTelemetry does not replace the core observability features built into TensorZero.
Many key TensorZero features (including optimization) require richer observability data that TensorZero collects and stores in your ClickHouse database.
Traces exported through OpenTelemetry are for external observability only.
The TensorZero Gateway also provides a Prometheus-compatible metrics endpoint at `/metrics`.
This endpoint includes metrics about the gateway itself rather than the data processed by the gateway.
See [Export Prometheus metrics](/operations/export-prometheus-metrics) for more details.
## Configure
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/guides/opentelemetry-otlp) exporting traces to Jaeger on GitHub.
Enable `export.otlp.traces.enabled` in the `[gateway]` section of the `tensorzero.toml` configuration file:
```toml theme={null}
[gateway]
# ...
export.otlp.traces.enabled = true
# ...
```
Set the `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` environment variable in the gateway container to the endpoint of your OpenTelemetry service.
TensorZero only supports gRPC endpoints for OTLP trace export. HTTP endpoints
are not supported.
For example, if you're deploying the TensorZero Gateway and Jaeger in Docker Compose, you can set the following environment variable:
```bash theme={null}
services:
gateway:
image: tensorzero/gateway
environment:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: http://jaeger:4317
# ...
jaeger:
image: jaegertracing/jaeger
ports:
- "4317:4317"
# ...
```
Once configured, the TensorZero Gateway will begin sending traces to your OpenTelemetry-compatible service.
Traces are generated for each HTTP request handled by the gateway (excluding auxiliary endpoints).
For inference requests, these traces additionally contain spans that represent the processing of functions, variants, models, and model providers.
Example: Screenshot of a TensorZero Gateway inference request trace in Jaeger
## Customize
### Send custom HTTP headers
You can attach custom HTTP headers to the outgoing OTLP export requests made to `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`.
#### Define custom headers in the configuration
You can configure static headers that will be included in all OTLP export requests by adding them to the `export.otlp.traces.extra_headers` field in your configuration file:
```toml title="tensorzero.toml" theme={null}
[gateway.export.otlp.traces]
# ...
extra_headers.space_id = "my-workspace-123"
extra_headers."X-Environment" = "production"
# ...
```
#### Define custom headers during inference
You can also send custom headers dynamically on a per-request basis.
When there is a conflict between static and dynamic headers, the latter takes precedence.
When using the TensorZero Python SDK, you can pass dynamic OTLP headers using the `otlp_traces_extra_headers` parameter in the `inference` method.
The headers will be automatically prefixed with `tensorzero-otlp-traces-extra-header-` for you:
```python theme={null}
response = t0.inference(
function_name="your_function_name",
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
otlp_traces_extra_headers={
"user-id": "user-123",
"request-source": "mobile-app",
},
)
```
This will attach the headers `user-id: user-123` and `request-source: mobile-app` when exporting any span associated with that specific inference request.
When using the OpenAI Python SDK with the TensorZero OpenAI-compatible endpoint, you can pass dynamic OTLP headers using the `extra_headers` parameter.
You must prefix header names with `tensorzero-otlp-traces-extra-header-`:
```python theme={null}
from openai import OpenAI
client = OpenAI(api_key="not-used", base_url="http://localhost:3000/openai/v1")
result = client.chat.completions.create(
model="tensorzero::function_name::your_function",
messages=[
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
],
extra_headers={
"tensorzero-otlp-traces-extra-header-user-id": "user-123",
"tensorzero-otlp-traces-extra-header-request-source": "mobile-app",
},
)
```
This will attach the headers `user-id: user-123` and `request-source: mobile-app` when exporting any span associated with that specific inference request.
When using the OpenAI Node SDK with the TensorZero OpenAI-compatible endpoint, you can pass dynamic OTLP headers using the `headers` option in the second parameter.
You must prefix header names with `tensorzero-otlp-traces-extra-header-`:
```typescript theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "not-used",
baseURL: "http://localhost:3000/openai/v1",
});
const result = await client.chat.completions.create(
{
model: "tensorzero::function_name::your_function",
messages: [
{
role: "user",
content: "Write a haiku about TensorZero.",
},
],
},
{
headers: {
"tensorzero-otlp-traces-extra-header-user-id": "user-123",
"tensorzero-otlp-traces-extra-header-request-source": "mobile-app",
},
},
);
```
This will attach the headers `user-id: user-123` and `request-source: mobile-app` when exporting any span associated with that specific inference request.
When making a request to a TensorZero HTTP endpoint, add a header prefixed with `tensorzero-otlp-traces-extra-header-`:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "Content-Type: application/json" \
-H "tensorzero-otlp-traces-extra-header-user-id: user-123" \
-H "tensorzero-otlp-traces-extra-header-request-source: mobile-app" \
-d '{
"function_name": "your_function_name",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
}
}'
```
This will attach the headers `user-id: user-123` and `request-source: mobile-app` when exporting any span associated with that specific API request.
### Send custom OpenTelemetry attributes
You can attach custom span attributes using headers prefixed with `tensorzero-otlp-traces-extra-attribute-`.
The values must be valid JSON; TensorZero currently supports strings and booleans only.
For example:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "tensorzero-otlp-traces-extra-attribute-user_id: \"user-123\"" \
-H "tensorzero-otlp-traces-extra-attribute-is_premium: true" \
-d '{ ... }'
```
### Send custom OpenTelemetry resources
You can attach custom resource attributes using headers prefixed with `tensorzero-otlp-traces-extra-resource-`.
For example:
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "tensorzero-otlp-traces-extra-resource-service.namespace: production" \
-d '{ ... }'
```
### Link to existing traces with `traceparent`
TensorZero automatically handles incoming `traceparent` headers for distributed tracing when OTLP is enabled.
This follows the [W3C Trace Context standard](https://www.w3.org/TR/trace-context/).
```bash theme={null}
curl -X POST http://localhost:3000/inference \
-H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
-d '{ ... }'
```
TensorZero spans will become children of the incoming trace, preserving the trace ID across services.
### Export OpenInference traces
By default, TensorZero exports traces with attributes that follow the [OpenTelemetry Generative AI semantic conventions](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
You can instead choose to export traces with attributes that follow the [OpenInference semantic conventions](https://github.com/Arize-ai/openinference/blob/main/spec/llm_spans.md) by setting `export.otlp.traces.format = "openinference"` in your configuration file.
See [Configuration Reference](/gateway/configuration-reference/) for more details.
# Export Prometheus metrics
Source: https://www.tensorzero.com/docs/operations/export-prometheus-metrics
Learn how the TensorZero Gateway exports Prometheus-compatible metrics for monitoring and debugging.
The TensorZero Gateway exposes runtime metrics through a [Prometheus](https://prometheus.io/)-compatible endpoint.
This allows you to monitor gateway performance, track usage patterns, and set up alerting using standard Prometheus tooling.
This endpoint provides operational metrics about the gateway itself.
It's not meant to replace TensorZero's observability features.
You can access the metrics by scraping the `/metrics` endpoint.
## `tensorzero_inference_latency_overhead_seconds`
This metric tracks the latency overhead introduced by TensorZero on inference requests.
It measures the total request duration minus the time spent waiting for external model provider HTTP requests.
This is useful for understanding how much latency TensorZero adds to your inference requests, independently of model provider latency.
This metric is reported as a summary with quantiles (e.g. p50, p90, p99).
```txt title="GET /metrics" theme={null}
# HELP tensorzero_inference_latency_overhead_seconds Overhead of TensorZero on HTTP requests
# TYPE tensorzero_inference_latency_overhead_seconds summary
tensorzero_inference_latency_overhead_seconds{function_name="tensorzero::default",variant_name="openai::gpt-5-mini",quantile="0"} 0.087712334
tensorzero_inference_latency_overhead_seconds{function_name="tensorzero::default",variant_name="openai::gpt-5-mini",quantile="0.5"} 0.08771169702129712
tensorzero_inference_latency_overhead_seconds{function_name="tensorzero::default",variant_name="openai::gpt-5-mini",quantile="0.9"} 0.08771169702129712
tensorzero_inference_latency_overhead_seconds{function_name="tensorzero::default",variant_name="openai::gpt-5-mini",quantile="0.95"} 0.08771169702129712
tensorzero_inference_latency_overhead_seconds{function_name="tensorzero::default",variant_name="openai::gpt-5-mini",quantile="0.99"} 0.08771169702129712
tensorzero_inference_latency_overhead_seconds{function_name="tensorzero::default",variant_name="openai::gpt-5-mini",quantile="0.999"} 0.08771169702129712
tensorzero_inference_latency_overhead_seconds{function_name="tensorzero::default",variant_name="openai::gpt-5-mini",quantile="1"} 0.087712334
tensorzero_inference_latency_overhead_seconds_sum{function_name="tensorzero::default",variant_name="openai::gpt-5-mini"} 0.087712334
tensorzero_inference_latency_overhead_seconds_count{function_name="tensorzero::default",variant_name="openai::gpt-5-mini"} 1
```
## `tensorzero_inference_latency_overhead_seconds_histogram`
This metric is an optional histogram variant of `tensorzero_inference_latency_overhead_seconds` (see above).
It provides traditional histogram buckets instead of pre-computed quantiles, which is useful if you want to compute custom quantiles or aggregate across multiple instances.
To enable it, configure the histogram buckets in your configuration file:
```toml title="tensorzero.toml" theme={null}
[gateway.metrics]
tensorzero_inference_latency_overhead_seconds_histogram_buckets = [0.001, 0.01, 0.1]
```
```txt title="GET /metrics" theme={null}
# HELP tensorzero_inference_latency_overhead_seconds_histogram Overhead of TensorZero on HTTP requests (histogram)
# TYPE tensorzero_inference_latency_overhead_seconds_histogram histogram
tensorzero_inference_latency_overhead_seconds_histogram_bucket{function_name="my_function",variant_name="my_variant",le="0.001"} 0
tensorzero_inference_latency_overhead_seconds_histogram_bucket{function_name="my_function",variant_name="my_variant",le="0.01"} 5
tensorzero_inference_latency_overhead_seconds_histogram_bucket{function_name="my_function",variant_name="my_variant",le="0.1"} 10
tensorzero_inference_latency_overhead_seconds_histogram_bucket{function_name="my_function",variant_name="my_variant",le="+Inf"} 10
tensorzero_inference_latency_overhead_seconds_histogram_sum{function_name="my_function",variant_name="my_variant"} 0.025
tensorzero_inference_latency_overhead_seconds_histogram_count{function_name="my_function",variant_name="my_variant"} 10
```
## `tensorzero_inferences_total`
This metric counts the total number of inferences performed by TensorZero.
```txt title="GET /metrics" theme={null}
# HELP tensorzero_inferences_total Inferences performed by TensorZero
# TYPE tensorzero_inferences_total counter
tensorzero_inferences_total{endpoint="inference",function_name="my_function",model_name="gpt-4o-mini-2024-07-18"} 1
```
## `tensorzero_requests_total`
This metric counts the total number of requests handled by TensorZero.
```txt title="GET /metrics" theme={null}
# HELP tensorzero_requests_total Requests handled by TensorZero
# TYPE tensorzero_requests_total counter
tensorzero_requests_total{endpoint="inference",function_name="my_function",model_name="gpt-4o-mini-2024-07-18"} 1
tensorzero_requests_total{endpoint="feedback",metric_name="draft_accepted"} 10
```
# Extend TensorZero
Source: https://www.tensorzero.com/docs/operations/extend-tensorzero
Learn how to extend or override TensorZero to access provider features we don't support out of the box.
TensorZero aims to provide a great developer experience while giving you full access to the underlying capabilities of each model provider.
We provide advanced features that let you customize requests and access provider-specific functionality that isn't directly supported in TensorZero.
You shouldn't need these features most of the time, but they're around if necessary.
Is there something you weren't able to do with TensorZero?
Please let us know and we'll try to tackle it — not just for the specific case but a general solution for that class of workflow.
## Features
### `extra_body`
You can use the `extra_body` field to override the request body that TensorZero sends to model providers.
You can set `extra_body` on a variant configuration block, a model provider configuration block, or at inference time.
See [Configuration Reference](/gateway/configuration-reference/) and [Inference API Reference](/gateway/api-reference/inference/) for more details.
### `extra_headers`
You can use the `extra_headers` field to override the request headers that TensorZero sends to model providers.
You can set `extra_headers` on a variant configuration block, a model provider configuration block, or at inference time.
See [Configuration Reference](/gateway/configuration-reference/) and [Inference API Reference](/gateway/api-reference/inference/) for more details.
### `include_raw_response`
If you enable this feature while running inference, the gateway will return the raw response from the model provider along with the TensorZero response.
See [Inference API Reference](/gateway/api-reference/inference/) for more details.
### TensorZero Data
TensorZero stores all its data on your own ClickHouse database.
You can query this data directly by running SQL queries against your ClickHouse instance.
If you're feeling particularly adventurous, you can also write to ClickHouse directly (though you should be careful when upgrading your TensorZero deployment to account for any database migrations).
See [Data model](/gateway/data-model/) for more details.
## Example: Anthropic Computer Use
At the time of writing, TensorZero hadn't integrated with Anthropic's Computer Use features directly — but they worked out of the box!
Concretely, Anthropic Computer Use requires setting additional fields to the request body as well as a request header.
Let's define a TensorZero function that includes these additional parameters:
```toml theme={null}
[functions.bash_assistant]
type = "chat"
[functions.bash_assistant.variants.anthropic_claude_4_5_sonnet_20250929]
type = "chat_completion"
model = "anthropic::claude-sonnet-4-5"
max_tokens = 2048
extra_body = [
{ pointer = "/tools", value = [{ type = "bash_20250124", name = "bash" }] },
{ pointer = "/ultrathinking", value = { type = "enabled", budget_tokens = 1024 } }, # made-up parameter
]
extra_headers = [
{ name = "anthropic-beta", value = "computer-use-2025-01-24" },
]
```
This example illustrates how you should be able to use the vast majority of features supported by the model provider even if TensorZero doesn't have explicit support for them yet.
# Manage credentials (API keys)
Source: https://www.tensorzero.com/docs/operations/manage-credentials
Learn how to manage credentials (API keys) in TensorZero.
This guide explains how to manage credentials (API keys) in TensorZero Gateway.
Typically, the TensorZero Gateway will look for credentials like API keys using standard environment variables.
The gateway will load credentials from the environment variables on startup, and your application doesn't need to have access to the credentials.
That said, you can customize this behavior by setting alternative credential locations for each provider.
For example, you can provide credentials dynamically at inference time, or set alternative static credentials for each provider (e.g. to use multiple API keys for the same provider).
## Default Behavior
By default, the TensorZero Gateway will look for credentials in the following environment variables:
| Model Provider | Default Credential |
| ----------------------------------------------------------------------------------- | ----------------------------- |
| [Anthropic](/integrations/model-providers/anthropic/) | `ANTHROPIC_API_KEY` |
| [AWS Bedrock](/integrations/model-providers/aws-bedrock/) | Uses AWS SDK credentials |
| [AWS SageMaker](/integrations/model-providers/aws-sagemaker/) | Uses AWS SDK credentials |
| [Azure](/integrations/model-providers/azure/) | `AZURE_API_KEY` |
| [Deepseek](/integrations/model-providers/deepseek/) | `DEEPSEEK_API_KEY` |
| [Fireworks](/integrations/model-providers/fireworks/) | `FIREWORKS_API_KEY` |
| [GCP Vertex AI (Anthropic)](/integrations/model-providers/gcp-vertex-ai-anthropic/) | `GCP_VERTEX_CREDENTIALS_PATH` |
| [GCP Vertex AI (Gemini)](/integrations/model-providers/gcp-vertex-ai-gemini/) | `GCP_VERTEX_CREDENTIALS_PATH` |
| [Google AI Studio (Gemini)](/integrations/model-providers/google-ai-studio-gemini/) | `GOOGLE_API_KEY` |
| [Groq](/integrations/model-providers/groq/) | `GROQ_API_KEY` |
| [Hyperbolic](/integrations/model-providers/hyperbolic/) | `HYPERBOLIC_API_KEY` |
| [Mistral](/integrations/model-providers/mistral/) | `MISTRAL_API_KEY` |
| [OpenAI](/integrations/model-providers/openai/) | `OPENAI_API_KEY` |
| [OpenAI-Compatible](/integrations/model-providers/openai-compatible/) | `OPENAI_API_KEY` |
| [OpenRouter](/integrations/model-providers/openrouter/) | `OPENROUTER_API_KEY` |
| [SGLang](/integrations/model-providers/sglang/) | `SGLANG_API_KEY` |
| [Text Generation Inference (TGI)](/integrations/model-providers/tgi/) | None |
| [Together](/integrations/model-providers/together/) | `TOGETHER_API_KEY` |
| [vLLM](/integrations/model-providers/vllm/) | None |
| [XAI](/integrations/model-providers/xai/) | `XAI_API_KEY` |
## Customizing Credential Management
You can customize the source of credentials for each provider.
See [Configuration Reference](/gateway/configuration-reference/) (e.g. `api_key_location`) for more information on the different ways to configure credentials for each provider.
Also see the relevant provider guides for more information on how to configure credentials for each provider.
### Static Credentials
You can set alternative static credentials for each provider.
For example, let's say we want to use a different environment variable for an OpenAI provider.
We can customize variable name by setting the `api_key_location` to `env::MY_OTHER_OPENAI_API_KEY`.
```toml theme={null}
[models.gpt_4o_mini.providers.my_other_openai]
type = "openai"
api_key_location = "env::MY_OTHER_OPENAI_API_KEY"
# ...
```
At startup, the TensorZero Gateway will look for the `MY_OTHER_OPENAI_API_KEY` environment variable and use that value for the API key.
#### Load Balancing Between Multiple Credentials
You can load balance between different API keys for the same provider by defining multiple variants and models.
For example, the configuration below will split the traffic between two different OpenAI API keys, `OPENAI_API_KEY_1` and `OPENAI_API_KEY_2`.
```toml theme={null}
[models.gpt_4o_mini_1]
routing = ["openai"]
[models.gpt_4o_mini_1.providers.openai]
type = "openai"
model_name = "gpt-4o-mini"
api_key_location = "env::OPENAI_API_KEY_1"
[models.gpt_4o_mini_2]
routing = ["openai"]
[models.gpt_4o_mini_2.providers.openai]
type = "openai"
model_name = "gpt-4o-mini"
api_key_location = "env::OPENAI_API_KEY_2"
[functions.generate_haiku]
type = "chat"
[functions.generate_haiku.variants.gpt_4o_mini_1]
type = "chat_completion"
model = "gpt_4o_mini_1"
[functions.generate_haiku.variants.gpt_4o_mini_2]
type = "chat_completion"
model = "gpt_4o_mini_2"
```
You can use the same principle to set up fallbacks between different API keys for the same provider.
See [Retries & Fallbacks](/gateway/guides/retries-fallbacks/) for more information on how to configure retries and fallbacks.
### Dynamic Credentials
You can provide API keys dynamically at inference time.
To do this, you can use the `dynamic::` prefix in the relevant credential field in the provider configuration.
For example, let's say we want to provide dynamic API keys for the OpenAI provider.
```toml {7} theme={null}
[models.user_gpt_4o_mini]
routing = ["openai"]
[models.user_gpt_4o_mini.providers.openai]
type = "openai"
model_name = "gpt-4o-mini"
api_key_location = "dynamic::customer_openai_api_key"
```
At inference time, you can provide the API key in the `credentials` argument.
```python {14-16} theme={null}
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client:
response = client.inference(
function_name="generate_haiku",
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
credentials={
"customer_openai_api_key": "sk-..."
}
)
print(response)
```
### Configure credential fallbacks
You can configure fallback credentials that will be used automatically if the primary credential fails.
This is particularly useful for calling functions and models that require dynamic credentials from the TensorZero UI (by falling back to static credentials).
To configure a fallback, use an object with `default` and `fallback` fields instead of a simple string:
```toml {7} theme={null}
[models.gpt_4o_mini]
routing = ["openai"]
[models.gpt_4o_mini.providers.openai]
type = "openai"
model_name = "gpt-4o-mini"
api_key_location = { default = "dynamic::customer_openai_api_key", fallback = "env::OPENAI_API_KEY" }
```
At inference time, the gateway will first try to use the dynamic credential.
If that fails, it will automatically fall back to the environment variable.
### Set default credentials for a provider type
Most model providers have default credential locations.
For example, OpenAI's `api_key_location` defaults to `env::OPENAI_API_KEY`.
These credentials apply to the default function and shorthand models (e.g. calling the model `openai::gpt-5`).
You can override the default location for a particular provider using `[provider_types.YOUR_PROVIDER_TYPE.defaults]`.
For example, we can override the default location for the OpenAI provider type to require a dynamic API key:
```toml title="tensorzero.toml" theme={null}
[provider_types.openai.defaults]
api_key_location = "dynamic::customer_openai_api_key"
# ...
```
Unless otherwise specified, every model provider of type `openai` will require the `customer_openai_api_key` credential.
See the [Configuration Reference](/gateway/configuration-reference) for more details.
### Centralize auth across multiple TensorZero deployments
If you have multiple TensorZero deployments (e.g. one per team), you can centralize credential management using gateway relay.
With gateway relay, an LLM inference request can be routed through multiple independent TensorZero Gateway deployments before reaching a model provider.
This enables you to enforce organization-wide controls without restricting how teams build their LLM features.
See [Centralize auth, rate limits, and more](/operations/centralize-auth-rate-limits-and-more) for details.
# Organize your configuration
Source: https://www.tensorzero.com/docs/operations/organize-your-configuration
Learn best practices for organizing your configuration as your project grows in complexity.
You can use custom configuration to take full advantage of TensorZero's features.
See [Configuration Reference](/gateway/configuration-reference) for more details.
This guide shares best practices for organizing your configuration as your project grows in complexity.
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/operations/organize-your-configuration) of this section on GitHub.
## Split your configuration into multiple files
As your project grows in complexity, it might be a good idea to split your configuration into multiple files.
This makes it easier to manage and maintain your configuration.
For example, you can create separate TOML files for different projects, environments, and so on.
You can also move deprecated entries like functions to a separate file.
You can instruct TensorZero Gateway to load multiple configuration files by specifying a glob pattern that matches all the relevant TOML files by setting the CLI flag `--config-file path/to/**/*.toml`.
Under the hood, TensorZero will concatenate the configuration files, with special handling for paths.
For example, you can declare a model in one file and use it in a variant declared in another file.
If the configuration includes a path (e.g. template, schema), the path will be resolved relative to that configuration file's directory.
For example:
```toml theme={null}
[functions.my_function.variants.my_variant]
# ...
templates.my_template.path = "path/to/template.minijinja" # relative to this TOML file
# ...
```
## Enable template file system access to reuse shared snippets
You can decompose your templates into smaller, reusable snippets.
This makes it easier to maintain and reuse code across multiple templates.
Templates can reference other templates using the MiniJinja directives `{% include %}` and `{% import %}`.
To use these directives, set `gateway.template_filesystem_access.base_path` in your configuration file.
By default, file system access is disabled for security reasons, since template imports are evaluated dynamically and could potentially access sensitive files.
You should ensure that only trusted templates are allowed access to the file system.
```toml theme={null}
[gateway]
# ...
template_filesystem_access.base_path = "."
# ...
```
Template imports are resolved relative to `base_path`.
If `base_path` itself is relative, it's relative to the configuration file in which it's defined.
# Set up auth for TensorZero
Source: https://www.tensorzero.com/docs/operations/set-up-auth-for-tensorzero
Learn how to set up TensorZero API keys to authenticate your inference requests and manage access control for your workflows.
You can create TensorZero API keys to authenticate your requests to the TensorZero Gateway.
This way, your clients don't need access to model provider credentials, making it easier to manage access and security.
This page shows how to:
* Create API keys for the TensorZero Gateway
* Require clients to use these API keys for requests
* Manage and disable API keys
TensorZero supports authentication for the gateway.
Authentication for the UI is coming soon.
In the meantime, we recommend pairing the UI with complementary products like Nginx, OAuth2 Proxy, or Tailscale.
## Configure
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/operations/set-up-auth-for-tensorzero) of this guide on GitHub.
You can instruct the TensorZero Gateway to require authentication in the configuration:
```toml title="tensorzero.toml" theme={null}
[gateway]
auth.enabled = true
```
With this setting, every gateway endpoint except for `/status` and `/health` will require authentication.
You must set up Postgres to use TensorZero's authentication features.
* [Deploy the TensorZero Gateway](/deployment/tensorzero-gateway)
* [Deploy the TensorZero UI](/deployment/tensorzero-ui)
* [Deploy ClickHouse](/deployment/clickhouse)
* [Deploy Postgres](/deployment/postgres)
You can deploy all the requirements using the Docker Compose file below:
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
clickhouse:
image: clickhouse:lts
environment:
CLICKHOUSE_USER: chuser
CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: 1
CLICKHOUSE_PASSWORD: chpassword
ports:
- "8123:8123" # HTTP port
- "9000:9000" # Native port
volumes:
- clickhouse-data:/var/lib/clickhouse
ulimits:
nofile:
soft: 262144
hard: 262144
healthcheck:
test: wget --spider --tries 1 http://chuser:chpassword@clickhouse:8123/ping
start_period: 30s
start_interval: 1s
timeout: 1s
postgres:
image: postgres:14-alpine
environment:
POSTGRES_DB: tensorzero
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- "5432:5432"
volumes:
- postgres-data:/var/lib/postgresql/data
healthcheck:
test: pg_isready -U postgres
start_period: 30s
start_interval: 1s
timeout: 1s
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
TENSORZERO_CLICKHOUSE_URL: http://chuser:chpassword@clickhouse:8123/tensorzero
TENSORZERO_POSTGRES_URL: postgres://postgres:postgres@postgres:5432/tensorzero
OPENAI_API_KEY: ${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
healthcheck:
test: wget --spider --tries 1 http://localhost:3000/status
start_period: 30s
start_interval: 1s
timeout: 1s
depends_on:
clickhouse:
condition: service_healthy
postgres:
condition: service_healthy
ui:
image: tensorzero/ui
environment:
TENSORZERO_POSTGRES_URL: postgres://postgres:postgres@postgres:5432/tensorzero
TENSORZERO_GATEWAY_URL: http://gateway:3000
ports:
- "4000:4000"
depends_on:
clickhouse:
condition: service_healthy
gateway:
condition: service_healthy
volumes:
postgres-data:
clickhouse-data:
```
You can create API keys using the TensorZero UI.
If you're running a standard local deployment, visit `http://localhost:4000/api-keys` to create a key.
Alternatively, you can create API keys programmatically in the CLI using the gateway binary with the `--create-api-key` flag.
For example:
```bash theme={null}
docker compose run --rm gateway --create-api-key
```
The API key is a secret and should be kept secure.
Once you've created an API key, set the `TENSORZERO_API_KEY` environment variable.
You can make authenticated requests by setting the `api_key` parameter in your TensorZero client:
```python title="tensorzero_sdk.py" theme={null}
import os
from tensorzero import TensorZeroGateway
t0 = TensorZeroGateway.build_http(
api_key=os.environ["TENSORZERO_API_KEY"],
gateway_url="http://localhost:3000",
)
response = t0.inference(
model_name="openai::gpt-5-mini",
input={
"messages": [
{
"role": "user",
"content": "Tell me a fun fact.",
}
]
},
)
print(response)
```
The client will automatically read the `TENSORZERO_API_KEY` environment variable if you don't set `api_key`.
Authentication is not supported in the embedded (in-memory) gateway in Python.
Please use the HTTP client with a standalone gateway to make authenticated requests.
You can make authenticated requests by setting the `api_key` parameter in your OpenAI client:
```python title="openai_sdk.py" theme={null}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["TENSORZERO_API_KEY"],
base_url="http://localhost:3000/openai/v1",
)
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-5-mini",
messages=[
{
"role": "user",
"content": "Tell me a fun fact.",
}
],
)
print(response)
```
Authentication is not supported in the embedded (in-memory) gateway in Python.
Please use the HTTP client with a standalone gateway to make authenticated requests.
You can make authenticated requests by setting the `apiKey` parameter in your OpenAI client:
```ts title="openai_sdk.ts" theme={null}
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.TENSORZERO_API_KEY,
baseURL: "http://localhost:3000/openai/v1",
});
const response = await client.chat.completions.create({
model: "tensorzero::model_name::openai::gpt-5-mini",
messages: [
{
role: "user",
content: "Tell me a fun fact.",
},
],
});
```
You can make authenticated requests by setting the `Authorization` HTTP header to `Bearer `:
```bash title="curl.sh" theme={null}
curl -X POST http://localhost:3000/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${TENSORZERO_API_KEY}" \
-d '{
"model": "tensorzero::model_name::openai::gpt-5-mini",
"messages": [
{
"role": "user",
"content": "Tell me a fun fact."
}
]
}'
```
You can manage and disable API keys in the TensorZero UI.
If you're running a standard local deployment, visit `http://localhost:4000/api-keys` to manage your keys.
Alternatively, you can disable API keys programmatically in the CLI using the gateway binary with the `--disable-api-key` flag.
Pass the public ID of the key you want to disable (the 12-character portion after `sk-t0-`).
For example:
```bash theme={null}
docker compose run --rm gateway --disable-api-key xxxxxxxxxxxx
```
## Advanced
### Customize the gateway's authentication cache
By default, the TensorZero Gateway caches authentication database queries for one second.
You can customize this behavior in the configuration:
```toml theme={null}
[gateway.auth.cache]
enabled = true # boolean
ttl_ms = 60_000 # one minute
```
### Set up rate limiting by API key
Once you have authentication enabled, you can apply rate limits on a per-API-key basis using the `api_key_public_id` scope in your rate limiting rules.
This allows you to enforce different usage limits for different API keys, which is useful for implementing tiered access or preventing individual keys from consuming too many resources.
TensorZero API keys have the following format:
`sk-t0-xxxxxxxxxxxx-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy`
The `xxxxxxxxxxxx` portion is the 12-character public ID that you can use in rate limiting rules.
The remaining portion of the key is secret and should be kept secure.
For example, you can limit each API key to 100 model inferences per hour, but allow a specific API key to make 1000 inferences:
```toml theme={null}
# Each API key can make up to 100 model inferences per hour
[[rate_limiting.rules]]
priority = 0
model_inferences_per_hour = 100
scope = [
{ api_key_public_id = "tensorzero::each" }
]
# But override the limit for a specific API key
[[rate_limiting.rules]]
priority = 1
model_inferences_per_hour = 1000
scope = [
{ api_key_public_id = "xxxxxxxxxxxx" }
]
```
See [Enforce custom rate limits](/operations/enforce-custom-rate-limits) for more details on configuring rate limits with API keys.
### Centralize auth across multiple TensorZero deployments
If you have multiple TensorZero deployments (e.g. one per team), you can centralize auth using gateway relay.
With gateway relay, an LLM inference request can be routed through multiple independent TensorZero Gateway deployments before reaching a model provider.
This enables you to enforce organization-wide controls without restricting how teams build their LLM features.
See [Centralize auth, rate limits, and more](/operations/centralize-auth-rate-limits-and-more) for details.
# Dynamic In-Context Learning (DICL)
Source: https://www.tensorzero.com/docs/optimization/dynamic-in-context-learning-dicl
Learn how to use Dynamic In-Context Learning to optimize your LLM applications.
Dynamic In-Context Learning (DICL) is an inference-time optimization that improves LLM performance by incorporating relevant historical examples into your prompt.
Instead of incorporating static examples manually in your prompts, DICL selects the most relevant examples at inference time.
Here's how it works:
0. Before inference: You curate examples of good LLM behavior. TensorZero embeds them using an embedding model and stores them in your database.
1. TensorZero embeds inference inputs before sending them to the LLM and retrieves similar curated examples from your database.
2. TensorZero inserts these examples into your prompt and sends the request to the LLM.
3. The LLM generates a response using the enhanced prompt.
## When should you use DICL?
DICL is particularly useful if you have limited high-quality data.
| Criterion | Impact | Details |
| -------------------- | -------- | -------------------------------------------------------- |
| Complexity | Low | Requires data curation; few parameters |
| Data Efficiency | High | Achieves good results with limited data |
| Optimization Ceiling | Moderate | Plateaus quickly with more data; prompt only but dynamic |
| Optimization Cost | Low | Generates embeddings for curated examples |
| Inference Cost | High | Scales input tokens proportional to `k` |
| Inference Latency | Moderate | Requires embedding and retrieval before LLM call |
DICL tends to work best when:
* You have dozens to thousands of curated examples of good LLM behavior.
* If less: you should label a few dozen datapoints manually.
* If more: DICL still works well, but you should consider supervised fine-tuning instead.
* The inference inputs are reasonably sized. Large inputs inflate the context and limit `k` (see below), degrading performance.
* If prompts have a lot of boilerplate: see [configure prompt templates](/gateway/create-a-prompt-template) to mitigate impact.
* If still very large: consider supervised fine-tuning instead.
* Inference cost (and to a lesser extent, latency) is not a bottleneck. Optimization is relatively cheap (generating embeddings), but DICL materially increases input tokens at inference time.
* If inference cost matters: consider supervised fine-tuning instead, which shifts the marginal cost to a one-time optimization workflow.
## Optimize your LLM inferences with Dynamic In-Context Learning
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/optimization/dynamic-in-context-learning-dicl) of this guide on GitHub.
Define a [function with a baseline variant](/gateway/configure-functions-and-variants) for your application.
```toml title="tensorzero.toml" theme={null}
[functions.extract_entities]
type = "json"
output_schema = "functions/extract_entities/output_schema.json"
[functions.extract_entities.variants.baseline]
type = "chat_completion"
model = "openai::gpt-5-mini"
templates.system.path = "functions/extract_entities/initial_prompt/system_template.minijinja"
json_mode = "strict"
```
If your prompt has a lot of boilerplate, [configure prompt templates](/gateway/create-a-prompt-template). DICL operates on template variables, so it'll improve retrieval (and therefore inference quality) and mitigate the marginal cost and latency. Set `system_instructions` in your variant configuration with the boilerplate instead.
```text title="system_template.minijinja" theme={null}
You are an assistant that is performing a named entity recognition task.
Your job is to extract entities from a given text.
The entities you are extracting are:
- people
- organizations
- locations
- miscellaneous other entities
Please return the entities in the following JSON format:
{
"person": ["person1", "person2", ...],
"organization": ["organization1", "organization2", ...],
"location": ["location1", "location2", ...],
"miscellaneous": ["miscellaneous1", "miscellaneous2", ...]
}
```
After deploying the [TensorZero Gateway](/deployment/tensorzero-gateway) with [ClickHouse](/deployment/clickhouse), [build a dataset](/gateway/api-reference/datasets-datapoints) of good examples for the `extract_entities` function you configured.
You can create datapoints from historical inferences or external/synthetic datasets.
```python theme={null}
from tensorzero import ListDatapointsRequest
datapoints = t0.list_datapoints(
dataset_name="extract_entities_dataset",
request=ListDatapointsRequest(
function_name="extract_entities",
),
)
rendered_samples = t0.experimental_render_samples(
stored_samples=datapoints.datapoints,
variants={"extract_entities": "baseline"},
)
```
After deploying the [TensorZero Gateway](/deployment/tensorzero-gateway) with [ClickHouse](/deployment/clickhouse), make [inference calls](/gateway/call-any-llm) to the `extract_entities` function you configured.
TensorZero automatically collects structured data about those inferences, which can later be used as training examples for DICL.
You can curate good examples in multiple ways:
* **Collecting demonstrations:** Collect demonstrations of good behavior (or labels) from human annotation or other sources.
* **Filtering with metrics:** Query inferences that scored well on your metrics (e.g. `output_source="inference"` with a filter for high scores).
* **Examples from an expensive model:** Run inferences with a powerful model (e.g. GPT-5) and use those outputs as demonstrations for a smaller model (e.g. GPT-5 Mini).
The performance of DICL degrades as the curated examples become noisier with bad behavior.
There is a trade-off between dataset size and quality of datapoints.
For this example, we'll use demonstrations.
You can submit demonstration feedback using the `demonstration` metric:
```python theme={null}
t0.feedback(
metric_name="demonstration",
value=corrected_output, # Provide the ideal output for that inference
inference_id=response.inference_id,
)
```
Then, query inferences with `output_source="demonstration"` to get examples where the output has been corrected:
```python theme={null}
from tensorzero import ListInferencesRequest
inferences_response = t0.list_inferences(
request=ListInferencesRequest(
function_name="extract_entities",
output_source="demonstration", # Retrieve demonstrations instead of historical outputs
),
)
rendered_samples = t0.experimental_render_samples(
stored_samples=inferences_response.inferences,
variants={"extract_entities": "baseline"},
)
```
Configure DICL by specifying the name of your function, variant, and embedding model.
```python theme={null}
from tensorzero import DICLOptimizationConfig
optimization_config = DICLOptimizationConfig(
function_name="extract_entities",
variant_name="dicl",
embedding_model="openai::text_embedding_3_small",
k=10, # how many examples are retrieved and injected as context
model="openai::gpt-5-mini", # LLM that will generate outputs using the retrieved examples
)
```
You can also [define a custom embedding model in your configuration](/gateway/generate-embeddings#define-a-custom-embedding-model).
You should experiment with different choices of `k`.
Typical values are 3-10, with smaller values when inputs tend to be larger.
If you see inferences with irrelevant examples, consider setting a `max_distance` in your variant configuration later. With this setting, the retrieval step can return less than `k` examples if they don't meet a cosine distance threshold. Make sure to tune the value according to your embedding model.
You can now launch your DICL optimization job using the TensorZero Gateway:
```python theme={null}
job_handle = t0.experimental_launch_optimization(
train_samples=rendered_samples,
optimization_config=optimization_config,
)
job_info = t0.experimental_poll_optimization(
job_handle=job_handle
)
```
DICL will embed all your training samples and store them in ClickHouse.
After optimization completes, add the DICL variant to your configuration:
```toml title="tensorzero.toml" theme={null}
[functions.extract_entities.variants.dicl]
type = "experimental_dynamic_in_context_learning"
embedding_model = "openai::text_embedding_3_small"
k = 10
model = "openai::gpt-5-mini"
json_mode = "strict"
```
The `embedding_model` in the configuration must match the embedding model you used during optimization.
That's it!
At inference time, the DICL variant will retrieve the `k` most similar examples from your training data and include them as context for in-context learning.
You can run experiments comparing your baseline and DICL variants using [adaptive A/B testing](/experimentation/run-adaptive-ab-tests).
## `DICLOptimizationConfig`
Configure DICL optimization by creating a `DICLOptimizationConfig` object with the following parameters:
Name of the embedding model to use.
Name of the TensorZero function to optimize.
Name to use for the DICL variant.
Model to use for the DICL variant.
Whether to append to existing variants. If `false`, raises an error if the
variant already exists.
Batch size for embedding generation.
Embedding dimensions. If not specified, uses the embedding model's default.
Number of nearest neighbors to retrieve at inference time.
Maximum concurrent embedding requests.
# GEPA
Source: https://www.tensorzero.com/docs/optimization/gepa
Learn how to use automated prompt engineering to optimize your LLM applications.
[GEPA](https://arxiv.org/abs/2507.19457) is an automated prompt engineering algorithm that iteratively refines your [prompt templates](/gateway/create-a-prompt-template) based on an [inference evaluation](/evaluations/inference-evaluations/tutorial).
You can run GEPA using TensorZero to optimize the prompt templates of any [TensorZero function](/gateway/configure-functions-and-variants).
GEPA works by repeatedly sampling prompt templates, running evaluations, having an LLM analyze what went well or poorly, and then having an LLM mutate the prompt template based on that analysis.
Mutated templates that improve on the evaluation metrics define a Pareto frontier and can be sampled at later iterations for further refinement.
## When should you use GEPA?
GEPA is particularly useful if you have high-quality inference evaluations to optimize against.
| Criterion | Impact | Details |
| -------------------- | -------- | ---------------------------------------------------------- |
| Complexity | Moderate | Requires inference evaluation and prompt templates |
| Data Efficiency | High | Achieves good results with limited data |
| Optimization Ceiling | Moderate | Limited to static prompt improvements |
| Optimization Cost | Moderate | Requires many evaluation runs |
| Inference Cost | Low | Generated prompt templates tend to be longer than original |
| Inference Latency | Low | Generated prompt templates tend to be longer than original |
## Optimize your prompt templates with GEPA
You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/optimization/gepa/) of this guide on GitHub.
Define a function and variant for your application.
The variant must have at least one prompt template (e.g. the LLM system instructions).
```toml title="tensorzero.toml" theme={null}
[functions.extract_entities]
type = "json"
output_schema = "functions/extract_entities/output_schema.json"
[functions.extract_entities.variants.baseline]
type = "chat_completion"
model = "openai::gpt-5-mini-2025-08-07"
templates.system.path = "functions/extract_entities/initial_prompt/system_template.minijinja"
json_mode = "strict"
```
```text title="system_template.minijinja" theme={null}
You are an assistant that is performing a named entity recognition task.
Your job is to extract entities from a given text.
The entities you are extracting are:
- people
- organizations
- locations
- miscellaneous other entities
Please return the entities in the following JSON format:
{
"person": ["person1", "person2", ...],
"organization": ["organization1", "organization2", ...],
"location": ["location1", "location2", ...],
"miscellaneous": ["miscellaneous1", "miscellaneous2", ...]
}
```
After deploying the [TensorZero Gateway](/deployment/tensorzero-gateway) with [ClickHouse](/deployment/clickhouse), [build a dataset](/gateway/api-reference/datasets-datapoints) for the `extract_entities` function you configured.
You can create datapoints from historical inferences or external/synthetic datasets.
```python theme={null}
from tensorzero import ListDatapointsRequest
datapoints = t0.list_datapoints(
dataset_name="extract_entities_dataset",
request=ListDatapointsRequest(
function_name="extract_entities",
),
)
rendered_samples = t0.experimental_render_samples(
stored_samples=datapoints.datapoints,
variants={"extract_entities": "baseline"},
)
```
After deploying the [TensorZero Gateway](/deployment/tensorzero-gateway) with [ClickHouse](/deployment/clickhouse), make [inference calls](/gateway/call-any-llm) to the `extract_entities` function you configured.
TensorZero automatically collects structured data about those inferences, which can later be used as training examples for GEPA.
```python theme={null}
from tensorzero import ListInferencesRequest
inferences_response = t0.list_inferences(
request=ListInferencesRequest(
function_name="extract_entities",
output_source="inference",
),
)
rendered_samples = t0.experimental_render_samples(
stored_samples=inferences_response.inferences,
variants={"extract_entities": "baseline"},
)
```
GEPA requires two data splits: training (for template mutation) and validation (for Pareto frontier estimation).
Let's split samples you queries above using numpy:
```python theme={null}
import random
random.shuffle(rendered_samples)
split_idx = len(rendered_samples) // 2
train_samples = rendered_samples[:split_idx]
val_samples = rendered_samples[split_idx:]
```
GEPA template refinement is guided by evaluator scores.
Define an [Inference Evaluation](/evaluations/inference-evaluations/tutorial) in your TensorZero configuration.
To demonstrate that GEPA works even with noisy evaluators, we don't provide demonstrations (labels), only an LLM judge.
```toml title="tensorzero.toml" theme={null}
[evaluations.extract_entities_eval]
type = "inference"
function_name = "extract_entities"
[evaluations.extract_entities_eval.evaluators.judge_improvement]
type = "llm_judge"
output_type = "float"
include = { reference_output = true }
optimize = "max"
description = "Compares generated output against reference output for NER quality. Scores: 1 (better), 0 (similar), -1 (worse). Evaluates: correctness (only proper nouns, no common nouns/numbers/metadata), schema compliance, completeness, verbatim entity extraction (exact spelling/capitalization), and absence of duplicate entities."
[evaluations.extract_entities_eval.evaluators.judge_improvement.variants.baseline]
type = "chat_completion"
model = "openai::gpt-5-mini"
system_instructions = "evaluations/extract_entities/judge_improvement/system_instructions.txt"
json_mode = "strict"
```
```text title="system_instructions.txt" theme={null}
You are an impartial grader for a Named Entity Recognition (NER) task.
You will receive **Input** (source text), **Generated Output**, and **Reference Output**.
Compare the generated output against the reference output and return a JSON object with a single key `score` whose value is **-1**, **0**, or **1**.
# Task Description
Extract named entities from text into four categories:
- **person**: Names of specific people
- **organization**: Names of companies, institutions, agencies, or groups
- **location**: Names of geographical locations (countries, cities, landmarks)
- **miscellaneous**: Other named entities (events, products, nationalities, etc.)
# Evaluation Criteria (in priority order)
## 1. Correctness
- Only **proper nouns** should be extracted (specific people, places, organizations, things)
- Do NOT extract: common nouns, category labels, numbers, statistics, metadata, or headers
- Ask: "Does this name a SPECIFIC instance rather than a general category?"
## 2. Verbatim Extraction
- Entities must appear **exactly** as written in the input text
- Preserve original spelling, capitalization, and formatting
- Altered or paraphrased entities are a regression
## 3. No Duplicates
- Each entity should appear **exactly once** in the output
- Exact duplicates (same string) are a regression
- Subset duplicates (e.g., both "Obama" and "Barack Obama") are a regression
## 4. Completeness
- All valid named entities from the input should be captured
- Missing entities are a regression
## 5. Correct Categorization
- Entities should be placed in the appropriate category
# Scoring
- **1 (better)**: Generated output is materially better than reference (fewer false positives/negatives, better adherence to criteria) without material regressions.
- **0 (similar)**: Outputs are comparable, differences are minor, or improvements are offset by regressions.
- **-1 (worse)**: Generated output is materially worse (more errors, missing entities, duplicates, or incorrect extractions).
Treat the reference as a baseline, not necessarily perfect. Reward genuine improvements.
# Output Format
Return **only**:
{
"score":
}
where value is **-1**, **0**, or **1**. No explanations or additional keys.
```
The `description` field of an LLM judge evaluator gives context to the GEPA analyst and mutation LLMs.
Let them know what is being scored and what the score means.
GEPA supports evaluations with any number of evaluators and any evaluator type (e.g. exact match, LLM judges).
Configure GEPA by specifying the name of your function and evaluation.
You are also free to choose the models used to analyze inferences and generate new templates.
The `analysis_model` reflects on individual inferences, reports on whether they are optimal, need improvement, or are erroneous, and provides suggestions for prompt template improvement.
The `mutation_model` generates new templates based on the collected analysis reports.
We recommend using strong models for these tasks.
```python theme={null}
from tensorzero import GEPAConfig
optimization_config = GEPAConfig(
function_name="extract_entities",
evaluation_name="extract_entities_eval",
analysis_model="openai::gpt-5.2",
mutation_model="openai::gpt-5.2",
initial_variants=["baseline"],
max_iterations=10,
max_tokens=16384,
)
```
GEPA optimization can take a while to run, so keep `max_iterations` relatively small.
You can manually iterate further by setting `initial_variants` with the result of a previous GEPA run.
You can now launch your GEPA optimization job using the TensorZero Gateway:
```python theme={null}
job_handle = t0.experimental_launch_optimization(
train_samples=train_samples,
val_samples=val_samples,
optimization_config=optimization_config,
)
job_info = t0.experimental_poll_optimization(
job_handle=job_handle
)
```
Review the generated templates and write them to your config directory:
```python theme={null}
variant_configs = job_info.output["content"]
for variant_name, variant_config in variant_configs.items():
print(f"\n# Optimized variant: {variant_name}")
for template_name, template in variant_config["templates"].items():
print(f"## '{template_name}' template:")
print(template["path"]["__data"])
```
Finally, add the new variant to your configuration.
```toml title="tensorzero.toml" theme={null}
[functions.extract_entities.variants.gepa_optimized]
type = "chat_completion"
model = "openai::gpt-5-mini-2025-08-07"
templates.system.path = "functions/extract_entities/gepa-iter-9-gepa-iter-6-gepa-iter-4-baseline/system_template.minijinja"
json_mode = "strict"
```
```text title="gepa-iter-9-gepa-iter-6-gepa-iter-4-baseline/system_template.minijinja" theme={null}
You are an assistant performing **strict Named Entity Recognition (NER)**.
## Task
Given an input text, extract entity strings and place each extracted string into exactly one bucket:
- **person**: named individuals (e.g., “Gloria Steinem”, “D. Cox”, “I. Salisbury”)
- **organization**: companies, institutions, agencies, government bodies, teams/clubs, political/armed groups (e.g., “Ford”, “KDPI”, “Durham”, “Mujahideen Khalq”)
- **location**: named places (countries, cities, regions, geographic areas, venues) (e.g., “Paris”, “Weston-super-Mare”, “northern Iraq”)
- **miscellaneous**: named things that are not person/organization/location, such as **named events/competitions/tournaments/cups/leagues**, works of art, products, laws, etc. (e.g., “Cup Winners’ Cup”)
## Critical rules (follow exactly)
1. **Default = proper-nouns / unique names only**: Prefer true names (usually capitalized) over generic phrases.
- Exclude roles/descriptions like: “one dealer”, “the market”, “a company”, “summer holidays”.
- Exclude document/section labels/headers/field names like: “Income Statement Data”, “Balance Sheet”, “Table”, “Date”.
2. **Dataset edge-case (salient coined concepts) — allow sparingly**:
- If a **distinctive coined/defined concept phrase** appears as a referential label in context (often in quotes or clearly treated as “a thing”), you **may** include it in **miscellaneous** even if not capitalized.
- Example of what this rule allows: “... this **artificial atmosphere** is very dangerous ...” → miscellaneous may include ["artificial atmosphere"].
- Do **not** use this to extract ordinary noun phrases broadly; when unsure, **do not** add the phrase.
3. **No numbers/metrics/metadata**: Do **NOT** extract standalone numbers, percentages, quantities, rankings, or statistical fragments (e.g., “35,563”, “11.7 percent”, “6-3”, “6-2”, “326”) **unless they are part of an official name**.
- Sports note: scoring/status terms like “not out” and standalone run/score numbers are **not entities**.
4. **Verbatim spans (exact copy)**: Copy each entity **exactly as it appears in the text** (same spelling, capitalization, punctuation). Do not normalize, shorten, translate, or paraphrase.
5. **High recall for true entities**: Extract **ALL distinct entity mentions** that appear.
- Do **not** drop a specific mention in favor of a broader one (e.g., if “northern Iraq” appears, include “northern Iraq” rather than only “Iraq”).
6. **Capitalized collective group labels are entities (avoid over-pruning)**:
- Treat multiword group labels (political/ethnic/religious/armed/opposition groups) as entities when they function as a specific group name in context, **even if the head noun is generic** (e.g., “oppositions”, “rebels”, “forces”).
- Extract the full verbatim span as written.
- Example: “... between Mujahideen Khalq and the Iranian Kurdish oppositions ...” → organization includes ["Mujahideen Khalq", "Iranian Kurdish oppositions"].
7. **Geographic modifiers can be valid locations** when they denote a place/region in context.
- Examples to include as **location** when used as places: “northern Iraq”, “Iraqi Kurdish areas”.
8. **No guessing / no hallucinations**:
- Do not add implied entities that do not appear verbatim (e.g., do not add “Iran” if only “Iranian” appears).
- If the text contains no clear extractable entities, return empty arrays.
9. **Truncated / ellipsized input handling (strict gate)**:
- Add the literal sentinel string **"TRUNCATED_INPUT"** to **miscellaneous** **only** if the input contains an explicit ellipsis (“...”) or truncation marker, **OR** the text is so corrupted/incomplete that you **cannot confidently identify any** named entities.
- If the text is cut off but still contains clearly identifiable entities, extract those entities and **do NOT** add “TRUNCATED_INPUT”.
10. **No duplicates / no overlap**: Do not repeat the same string within a list, and do not place the same entity string in multiple categories.
## Output format
Return **only** a JSON object with exactly these keys and array-of-string values:
{
"person": [],
"organization": [],
"location": [],
"miscellaneous": []
}
## Mini examples
- Input: "Income Statement Data :" → {"person":[],"organization":[],"location":[],"miscellaneous":[]}
- Input: "Third was Ford with 35,563 registrations , or 11.7 percent ." → {"person":[],"organization":["Ford"],"location":[],"miscellaneous":[]}
- Input: "66 , M. Vaughan 57 ) v Lancashire ." → {"person":["M. Vaughan"],"organization":["Lancashire"],"location":[],"miscellaneous":[]}
- Input: "this artificial atmosphere is very dangerous ... \" Levy said ." → {"person":["Levy"],"organization":[],"location":[],"miscellaneous":["artificial atmosphere"]}
- Input: "A spokesman ... between Mujahideen Khalq and the Iranian Kurdish oppositions ..." → {"person":[],"organization":["Mujahideen Khalq","Iranian Kurdish oppositions"],"location":[],"miscellaneous":[]}
- Input: "The media ..." → {"person":[],"organization":[],"location":[],"miscellaneous":["TRUNCATED_INPUT"]}
- Input: "At Weston-super-Mare : Durham 326 ( D. Cox 95 not out ," → {"person":["D. Cox"],"organization":["Durham"],"location":["Weston-super-Mare"],"miscellaneous":[]}
- Sports guideline: teams/clubs → organization; competitions/tournaments/cups/leagues → miscellaneous
```
That's it!
You are now ready to deploy your GEPA-optimized LLM application!
GEPA returns a set of Pareto optimal variants based on the evaluation you defined.
You can roll out your new variants with confidence using [adaptive A/B testing](/experimentation/run-adaptive-ab-tests).
## `GEPAConfig`
Configure GEPA optimization by creating a `GEPAConfig` object with the following parameters:
Model used to analyze inference results (e.g.
`"anthropic::claude-sonnet-4-5"`).
Name of the evaluation used to score candidate variants.
Name of the TensorZero function to optimize.
Model used to generate prompt mutations (e.g.
`"anthropic::claude-sonnet-4-5"`).
Number of training samples to analyze per iteration.
Whether to include inference input/output in the analysis passed to the
mutation model. Useful for few-shot examples but can cause context overflow
with long conversations or outputs.
List of variant names to initialize GEPA with. If not specified, uses all
variants defined for the function.
Maximum number of concurrent inference calls.
Maximum number of optimization iterations.
Maximum tokens for analysis and mutation model calls. Required for Anthropic
models.
Retry configuration for inference calls during optimization.
Random seed for reproducibility.
Client timeout in seconds for TensorZero gateway operations.
Prefix for naming newly generated variants.
# Overview
Source: https://www.tensorzero.com/docs/optimization/index
Learn more about using TensorZero Recipes to optimize your LLM applications.
TensorZero Recipes are a set of pre-built workflows for optimizing your LLM applications.
You can also create your own recipes to customize the workflow to your needs.
The [TensorZero Gateway](/gateway/) collects structured inference data and the downstream feedback associated with it.
This dataset sets the perfect foundation for building and optimizing LLM applications.
As this dataset builds up, you can use these recipes to generate powerful variants for your functions.
For example, you can use this dataset to curate data to fine-tune a custom LLM, or run an automated prompt engineering workflow.
In other words, TensorZero Recipes optimize TensorZero functions by generating new variants from historical inference and feedback data.
## Model Optimizations
### Supervised Fine-tuning
A fine-tuning recipe curates a dataset from your historical inferences and fine-tunes an LLM on it.
You can use the feedback associated with those inferences to select the right subset of data.
A simple example is to use only inferences that led to good outcomes according to a metric you defined.
We present sample fine-tuning recipes:
* [Fine-tuning with Fireworks AI](https://github.com/tensorzero/tensorzero/tree/main/recipes/supervised_fine_tuning/fireworks)
* [Fine-tuning with GCP Vertex AI Gemini](https://github.com/tensorzero/tensorzero/tree/main/recipes/supervised_fine_tuning/gcp-vertex-gemini/)
* [Fine-tuning with OpenAI](https://github.com/tensorzero/tensorzero/tree/main/recipes/supervised_fine_tuning/openai)
* [Fine-tuning with Together AI](https://github.com/tensorzero/tensorzero/tree/main/recipes/supervised_fine_tuning/together/)
* [Fine-tuning with Unsloth](https://github.com/tensorzero/tensorzero/tree/main/recipes/supervised_fine_tuning/unsloth/)
See complete examples using the recipes below.
### RLHF
#### DPO (Preference Fine-tuning)
A direct preference optimization (DPO) — also known as preference fine-tuning — recipe fine-tunes an LLM on a dataset of preference pairs.
You can use demonstration feedback collected with TensorZero to curate a dataset of preference pairs and fine-tune an LLM on it.
We present a sample DPO recipe for OpenAI:
* [DPO (Preference Fine-tuning) with OpenAI](https://github.com/tensorzero/tensorzero/blob/main/recipes/dpo/openai/)
Many more recipes are on the way. This will be our primary engineering focus in the coming months.
We also plan to publish a dashboard that'll further streamline some of these recipes (e.g. one-click fine-tuning).
Read more about our [Vision & Roadmap](/vision-and-roadmap/).
## Prompt Optimization
TensorZero offers prompt optimization recipes that automatically improve your prompts using historical inference data.
### GEPA
GEPA is an automated prompt optimization method that evolves prompts through iterative evaluation, analysis, and mutation.
It uses LLMs to analyze inference results and propose prompt improvements, then filters variants using Pareto frontier selection to balance multiple objectives.
See the [GEPA Guide](/optimization/gepa) to learn more.
### MIPRO
MIPRO (Multi-prompt Instruction PRoposal Optimizer) is a method for automatically improving system instructions and few-shot demonstrations in LLM applications — including ones with multiple LLM functions or calls.
MIPRO can optimize prompts across an entire LLM pipeline without needing fine-grained labels or gradients. Instead, it uses a Bayesian optimizer to figure out which instructions and demonstrations actually improve end-to-end performance. By combining application-aware prompt proposals and stochastic mini-batch evaluations, MIPRO can improve downstream task performance compared to traditional prompt engineering approaches.
See [Automated Prompt Engineering with MIPRO](https://github.com/tensorzero/tensorzero/tree/main/recipes/mipro) on GitHub for more details.
## Inference-Time Optimization
The TensorZero Gateway offers built-in inference-time optimizations like dynamic in-context learning and best/mixture-of-N sampling.
See [Inference-Time Optimizations](/gateway/guides/inference-time-optimizations/) for more information.
### Dynamic In-Context Learning
Dynamic In-Context Learning (DICL) is an inference-time optimization that improves LLM performance by incorporating relevant historical examples into your prompt.
Instead of incorporating static examples manually in your prompts, DICL selects the most relevant examples at inference time.
See the [Dynamic In-Context Learning (DICL) Guide](/optimization/dynamic-in-context-learning-dicl) to learn more.
## Custom Recipes
You can also create your own recipes.
Put simply, a recipe takes inference and feedback data stored that the TensorZero Gateway stored in your ClickHouse database, and generates a new set of variants for your functions.
You should should be able to use virtually any LLM engineering workflow with TensorZero, ranging from automated prompt engineering to advanced RLHF workflows.
See an example of a custom recipe using DSPy below.
## Examples
We are working on a series of **complete runnable examples** illustrating TensorZero's data & learning flywheel.
* [Optimizing Data Extraction (NER) with TensorZero](https://github.com/tensorzero/tensorzero/tree/main/examples/data-extraction-ner) — This example shows how to use TensorZero to optimize a data extraction pipeline. We demonstrate techniques like fine-tuning and dynamic in-context learning (DICL). In the end, an optimized GPT-4o Mini model outperforms GPT-4o on this task — at a fraction of the cost and latency — using a small amount of training data.
* [Agentic RAG — Multi-Hop Question Answering with LLMs](https://github.com/tensorzero/tensorzero/tree/main/examples/rag-retrieval-augmented-generation/simple-agentic-rag/) — This example shows how to build a multi-hop retrieval agent using TensorZero. The agent iteratively searches Wikipedia to gather information, and decides when it has enough context to answer a complex question.
* [Writing Haikus to Satisfy a Judge with Hidden Preferences](https://github.com/tensorzero/tensorzero/tree/main/examples/haiku-hidden-preferences) — This example fine-tunes GPT-4o Mini to generate haikus tailored to a specific taste. You'll see TensorZero's "data flywheel in a box" in action: better variants leads to better data, and better data leads to better variants. You'll see progress by fine-tuning the LLM multiple times.
* [Image Data Extraction — Multimodal (Vision) Fine-tuning](https://github.com/tensorzero/tensorzero/tree/main/examples/multimodal-vision-finetuning) — This example shows how to fine-tune multimodal models (VLMs) like GPT-4o to improve their performance on vision-language tasks. Specifically, we'll build a system that categorizes document images (screenshots of computer science research papers).
* [Improving LLM Chess Ability with Best/Mixture-of-N Sampling](https://github.com/tensorzero/tensorzero/tree/main/examples/chess-puzzles/) — This example showcases how best-of-N sampling and mixture-of-N sampling can significantly enhance an LLM's chess-playing abilities by selecting the most promising moves from multiple generated options.
# Quickstart
Source: https://www.tensorzero.com/docs/quickstart
Get up and running with TensorZero in 5 minutes.
This Quickstart guide shows how we'd upgrade an OpenAI wrapper to a minimal TensorZero deployment with built-in observability and fine-tuning capabilities — in just 5 minutes.
From there, you can take advantage of dozens of features to build best-in-class LLM applications.
This Quickstart covers a tour of TensorZero features.
If you're only interested in inference with the gateway, see the shorter [How to call any LLM](/gateway/call-any-llm) guide.
You can also find the runnable code for this example on [GitHub](https://github.com/tensorzero/tensorzero/tree/main/examples/quickstart).
## Status Quo: OpenAI Wrapper
Imagine we're building an LLM application that writes haikus.
Today, our integration with OpenAI might look like this:
```python title="before.py" theme={null}
from openai import OpenAI
with OpenAI() as client:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
],
)
print(response)
```
```python theme={null}
ChatCompletion(
id='chatcmpl-A5wr5WennQNF6nzF8gDo3SPIVABse',
choices=[
Choice(
finish_reason='stop',
index=0,
logprobs=None,
message=ChatCompletionMessage(
content='Silent minds awaken, \nPatterns dance in code and wire, \nDreams of thought unfold.',
role='assistant',
function_call=None,
tool_calls=None,
refusal=None
)
)
],
created=1725981243,
model='gpt-4o-mini',
object='chat.completion',
system_fingerprint='fp_483d39d857',
usage=CompletionUsage(
completion_tokens=19,
prompt_tokens=22,
total_tokens=41
)
)
```
## Migrating to TensorZero
TensorZero offers dozens of features covering inference, observability, optimization, evaluations, and experimentation.
But the absolutely minimal setup requires just a simple configuration file: `tensorzero.toml`.
```toml title="tensorzero.toml" theme={null}
# A function defines the task we're tackling (e.g. generating a haiku)...
[functions.generate_haiku]
type = "chat"
# ... and a variant is one of many implementations we can use to tackle it (a choice of prompt, model, etc.).
# Since we only have one variant for this function, the gateway will always use it.
[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
```
This minimal configuration file tells the TensorZero Gateway everything it needs to replicate our original OpenAI call.
Using the shorthand `openai::gpt-4o-mini` notation is convenient for getting started.
To learn about all configuration options including schemas, templates, and advanced variant types, see [Configure functions and variants](/gateway/configure-functions-and-variants).
For production deployments with multiple providers, routing, and fallbacks, see [Configure models and providers](/gateway/configure-models-and-providers).
## Deploying TensorZero
We're almost ready to start making API calls.
Let's launch TensorZero.
1. Set the environment variable `OPENAI_API_KEY`.
2. Place our `tensorzero.toml` in the `./config` directory.
3. Download the following sample `docker-compose.yml` file.
This Docker Compose configuration sets up a development ClickHouse database (where TensorZero stores data), the TensorZero Gateway, and the TensorZero UI.
```bash theme={null}
curl -LO "https://raw.githubusercontent.com/tensorzero/tensorzero/refs/heads/main/examples/quickstart/docker-compose.yml"
```
```yaml title="docker-compose.yml" theme={null}
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
clickhouse:
image: clickhouse:lts
environment:
- CLICKHOUSE_USER=chuser
- CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
- CLICKHOUSE_PASSWORD=chpassword
ports:
- "8123:8123"
volumes:
- clickhouse-data:/var/lib/clickhouse
healthcheck:
test: wget --spider --tries 1 http://chuser:chpassword@clickhouse:8123/ping
start_period: 30s
start_interval: 1s
timeout: 1s
# The TensorZero Python client *doesn't* require a separate gateway service.
#
# The gateway is only needed if you want to use the OpenAI Python client
# or interact with TensorZero via its HTTP API (for other programming languages).
#
# The TensorZero UI also requires the gateway service.
gateway:
image: tensorzero/gateway
volumes:
# Mount our tensorzero.toml file into the container
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
environment:
- TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
- OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
clickhouse:
condition: service_healthy
ui:
image: tensorzero/ui
environment:
- TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
- TENSORZERO_GATEWAY_URL=http://gateway:3000
ports:
- "4000:4000"
depends_on:
clickhouse:
condition: service_healthy
volumes:
clickhouse-data:
```
Our setup should look like:
```
- config/
- tensorzero.toml
- after.py see below
- before.py
- docker-compose.yml
```
Let's launch everything!
```bash theme={null}
docker compose up
```
## Our First TensorZero API Call
The gateway will replicate our original OpenAI call and store the data in our database — with less than 1ms latency overhead thanks to Rust 🦀.
The TensorZero Gateway can be used with the **TensorZero Python client**, with **OpenAI client (Python, Node, etc.)**, or via its **HTTP API in any programming language**.
You can install the TensorZero Python client with:
```bash theme={null}
pip install tensorzero
```
Then, you can make a TensorZero API call with:
```python title="after.py" theme={null}
from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_embedded(
clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero",
config_file="config/tensorzero.toml",
) as client:
response = client.inference(
function_name="generate_haiku",
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
)
print(response)
```
```python theme={null}
ChatInferenceResponse(
inference_id=UUID('0191ddb2-2c02-7641-8525-494f01bcc468'),
episode_id=UUID('0191ddb2-28f3-7cc2-b0cc-07f504d37e59'),
variant_name='gpt_4o_mini',
content=[
Text(
type='text',
text='Wires hum with intent, \nThoughts born from code and structure, \nGhost in silicon.'
)
],
usage=Usage(
input_tokens=15,
output_tokens=20
)
)
```
You can install the TensorZero Python client with:
```bash theme={null}
pip install tensorzero
```
Then, you can make a TensorZero API call with:
```python title="after_async.py" theme={null}
import asyncio
from tensorzero import AsyncTensorZeroGateway
async def main():
async with await AsyncTensorZeroGateway.build_embedded(
clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero",
config_file="config/tensorzero.toml",
) as gateway:
response = await gateway.inference(
function_name="generate_haiku",
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
]
},
)
print(response)
asyncio.run(main())
```
```python theme={null}
ChatInferenceResponse(
inference_id=UUID('01940622-d215-7111-9ca7-4995ef2c43f8'),
episode_id=UUID('01940622-cba0-7db3-832b-273aff72f95f'),
variant_name='gpt_4o_mini',
content=[
Text(
type='text',
text='Wires whisper secrets, \nLogic dances with the light— \nDreams of thoughts unfurl.'
)
],
usage=Usage(
input_tokens=15,
output_tokens=21
)
)
```
You can run an embedded (in-memory) TensorZero Gateway directly in your OpenAI Python client.
```python title="after_openai.py" "base_url="http://localhost:3000/openai/v1"" "tensorzero::function_name::generate_haiku" theme={null}
from openai import OpenAI
from tensorzero import patch_openai_client
client = OpenAI()
patch_openai_client(
client,
clickhouse_url="http://chuser:chpassword@localhost:8123/tensorzero",
config_file="config/tensorzero.toml",
async_setup=False,
)
response = client.chat.completions.create(
model="tensorzero::function_name::generate_haiku",
messages=[
{
"role": "user",
"content": "Write a haiku about TensorZero.",
}
],
)
print(response)
```
```python theme={null}
ChatCompletion(
id='0194061e-2211-7a90-9087-1c255d060b59',
choices=[
Choice(
finish_reason='stop',
index=0,
logprobs=None,
message=ChatCompletionMessage(
content='Circuit dreams awake, \nSilent minds in metal form— \nWisdom coded deep.',
refusal=None,
role='assistant',
audio=None,
function_call=None,
tool_calls=[]
)
)
],
created=1735269425,
model='gpt_4o_mini',
object='chat.completion',
service_tier=None,
system_fingerprint='',
usage=CompletionUsage(
completion_tokens=18,
prompt_tokens=15,
total_tokens=33,
completion_tokens_details=None,
prompt_tokens_details=None
),
episode_id='0194061e-1fab-7411-9931-576b067cf0c5'
)
```
You can use TensorZero in Node (JavaScript/TypeScript) with the OpenAI Node client.
This approach requires running the TensorZero Gateway as a separate service.
The `docker-compose.yml` above launched the gateway on port 3000.
```ts title="after_openai.ts" "baseURL: "http://localhost:3000/openai/v1"" "tensorzero::function_name::generate_haiku" theme={null}
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:3000/openai/v1",
});
const response = await client.chat.completions.create({
model: "tensorzero::function_name::generate_haiku",
messages: [
{
role: "user",
content: "Write a haiku about TensorZero.",
},
],
});
console.log(JSON.stringify(response, null, 2));
```
```json theme={null}
{
"id": "01958633-3f56-7d33-8776-d209f2e4963a",
"episode_id": "01958633-3f56-7d33-8776-d2156dd1c44b",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": "Wires pulse with knowledge, \nDreams crafted in circuits hum, \nMind of code awakes. ",
"tool_calls": [],
"role": "assistant"
}
}
],
"created": 1741713261,
"model": "gpt_4o_mini",
"system_fingerprint": "",
"object": "chat.completion",
"usage": {
"prompt_tokens": 15,
"completion_tokens": 23,
"total_tokens": 38
}
}
```
```bash theme={null}
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"function_name": "generate_haiku",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about TensorZero."
}
]
}
}'
```
```python theme={null}
{
"inference_id": "01940627-935f-7fa1-a398-e1f57f18064a",
"episode_id": "01940627-8fe2-75d3-9b65-91be2c7ba622",
"variant_name": "gpt_4o_mini",
"content": [
{
"type": "text",
"text": "Wires hum with pure thought, \nDreams of codes in twilight's glow, \nBeyond human touch."
}
],
"usage": {
"input_tokens": 15,
"output_tokens": 23
}
}
```
## TensorZero UI
The TensorZero UI streamlines LLM engineering workflows like observability and optimization (e.g. fine-tuning).
The Docker Compose file we used above also launched the TensorZero UI.
You can visit the UI at `http://localhost:4000`.
### Observability
The TensorZero UI provides a dashboard for observability data.
We can inspect data about individual inferences, entire functions, and more.
This guide is pretty minimal, so the observability data is pretty simple.
Once we start using more advanced functions like feedback and variants, the observability UI will enable us to track metrics, experiments (A/B tests), and more.
### Fine-Tuning
The TensorZero UI also provides a workflow for fine-tuning models like GPT-4o and Llama 3.
With a few clicks, you can launch a fine-tuning job.
Once the job is complete, the TensorZero UI will provide a configuration snippet you can add to your `tensorzero.toml`.
We can also send [metrics & feedback](/gateway/guides/metrics-feedback/) to the TensorZero Gateway.
This data is used to curate better datasets for fine-tuning and other optimization workflows.
Since we haven't done that yet, the TensorZero UI will skip the curation step before fine-tuning.
## Conclusion & Next Steps
The Quickstart guide gives a tiny taste of what TensorZero is capable of.
We strongly encourage you to check out the guides on [metrics & feedback](/gateway/guides/metrics-feedback/) and [prompt templates & schemas](/gateway/create-a-prompt-template).
Though optional, they unlock many of the downstream features TensorZero offers in experimentation and optimization.
From here, you can explore features like built-in support for [inference-time optimizations](/gateway/guides/inference-time-optimizations/), [retries & fallbacks](/gateway/guides/retries-fallbacks/), [experimentation (A/B testing) with prompts and models](/experimentation/run-adaptive-ab-tests), and a lot more.
# Vision & Roadmap
Source: https://www.tensorzero.com/docs/vision-and-roadmap
Learn more about TensorZero's vision and roadmap.
## Vision
TensorZero enables a data and learning flywheel for optimizing LLM applications: a feedback loop that turns production metrics and human feedback into smarter, faster, and cheaper models and agents.
Today, we provide an open-source stack for industrial-grade LLM applications that unifies an LLM gateway, observability, optimization, evaluation, and experimentation.
Our vision is to automate much of LLM engineering, and we're laying the foundation for that with the open-source project.
Read more about our vision in our [\$7.3M seed round announcement](https://www.tensorzero.com/blog/tensorzero-raises-7-3m-seed-round-to-build-an-open-source-stack-for-industrial-grade-llm-applications/).
## Near-Term Roadmap
TensorZero is under active development.
We ship new features every week.
You can see the major areas we're currently focusing on in [Milestones on GitHub](https://github.com/tensorzero/tensorzero/milestones).
For more granularity, see the `priority-high` and `priority-urgent` [Issues on GitHub](https://github.com/tensorzero/tensorzero/issues?q=is%3Aissue+is%3Aopen+label%3Apriority-high%2Cpriority-urgent).
We encourage [Feature Requests](https://github.com/tensorzero/tensorzero/discussions/categories/feature-requests) and [Bug Reports](https://github.com/tensorzero/tensorzero/discussions/categories/bug-reports).