POST /inference
The inference endpoint is the core of the TensorZero Gateway API.
Under the hood, the gateway validates the request, samples a variant from the function, handles templating when applicable, and routes the inference to the appropriate model provider.
If a problem occurs, it attempts to gracefully fallback to a different model provider or variant.
After a successful inference, it returns the data to the client and asynchronously stores structured information in the database.
See the API Reference for
POST /openai/v1/chat/completions
for an inference endpoint compatible with the OpenAI API.Request
additional_tools
- Type: a list of tools (see below)
- Required: no (default:
[]
)
description
, name
, parameters
, and strict
.
The fields are identical to those in the configuration file, except that the parameters
field should contain the JSON schema itself rather than a path to it.
See Configuration Reference for more details.
allowed_tools
- Type: list of strings
- Required: no
additional_tools
are always allowed, irrespective of this field.
cache_options
- Type: object
- Required: no (default:
{"enabled": "write_only"}
)
cache_options.enabled
- Type: string
- Required: no (default:
"write_only"
)
"write_only"
(default): Only write to cache but don’t serve cached responses"read_only"
: Only read from cache but don’t write new entries"on"
: Both read from and write to cache"off"
: Disable caching completely
dryrun=true
, the gateway never writes to the cache.
cache_options.max_age_s
- Type: integer
- Required: no (default:
null
)
max_age_s=3600
, the gateway will only use cache entries that were created in the last hour.
credentials
- Type: object (a map from dynamic credential names to API keys)
- Required: no (default: no credentials)
dynamic
location (e.g. dynamic::my_dynamic_api_key_name
).
See the configuration reference for more details.
The gateway expects the credentials to be provided in the credentials
field of the request body as specified below.
The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.
Example
Example
dryrun
- Type: boolean
- Required: no
true
, the inference request will be executed but won’t be stored to the database.
The gateway will still call the downstream model providers.
This field is primarily for debugging and testing, and you should generally not use it in production.
episode_id
- Type: UUID
- Required: no
episode_id
. If null, the gateway will generate a new episode ID and return it in the response.
Only use episode IDs that were returned by the TensorZero gateway.
extra_body
- Type: array of objects (see below)
- Required: no
extra_body
field allows you to modify the request body that TensorZero sends to a model provider.
This advanced feature is an “escape hatch” that lets you use provider-specific functionality that TensorZero hasn’t implemented yet.
Each object in the array must have three fields:
variant_name
ormodel_provider_name
: The modification will only be applied to the specified variant or model providerpointer
: A JSON Pointer string specifying where to modify the request body- One of the following:
value
: The value to insert at that location; it can be of any type including nested typesdelete = true
: Deletes the field at the specified location, if present.
You can also set
extra_body
in the configuration file.
The values provided at inference-time take priority over the values in the configuration file.
Example: `extra_body`
Example: `extra_body`
If TensorZero would normally send this request body to the provider……then the following …overrides the request body (for
extra_body
in the inference request…my_variant
only) to:extra_headers
- Type: array of objects (see below)
- Required: no
extra_headers
field allows you to modify the request headers that TensorZero sends to a model provider.
This advanced feature is an “escape hatch” that lets you use provider-specific functionality that TensorZero hasn’t implemented yet.
Each object in the array must have three fields:
variant_name
ormodel_provider_name
: The modification will only be applied to the specified variant or model providername
: The name of the header to modifyvalue
: The value to set the header to
You can also set
extra_headers
in the configuration file.
The values provided at inference-time take priority over the values in the configuration file.
Example: `extra_headers`
Example: `extra_headers`
If TensorZero would normally send the following request headers to the provider……then the following …overrides the request headers to:
extra_headers
…function_name
- Type: string
- Required: either
function_name
ormodel_name
must be provided
model_name
field to call a model directly, without the need to define a function.
See below for more details.
include_original_response
- Type: boolean
- Required: no
true
, the original response from the model will be included in the response in the original_response
field as a string.
See original_response
in the response section for more details.
input
- Type: varies
- Required: yes
input.messages
- Type: list of messages (see below)
- Required: no (default:
[]
)
role
: The role of the message (assistant
oruser
).content
: The content of the message (see below).
content
field can be have one of the following types:
- string: the text for a text message (only allowed if there is no schema for that role)
- list of content blocks: the content blocks for the message (see below)
type
and additional fields depending on the type.
If the content block has type text
, it must have either of the following additional fields:
text
: The text for the content block.arguments
: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see Prompt Templates & Schemas for details).
tool_call
, it must have the following additional fields:
arguments
: The arguments for the tool call.id
: The ID for the content block.name
: The name of the tool for the content block.
tool_result
, it must have the following additional fields:
id
: The ID for the content block.name
: The name of the tool for the content block.result
: The result of the tool call.
file
, it must have exactly one of the following additional fields:
url
: The URL for a remote file.mime_type
anddata
: The MIME type (e.g.image/png
,image/jpeg
,application/pdf
) andbase64
-encoded data for an embedded file.
raw_text
, it must have the following additional fields:
value
: The text for the content block. This content block will ignore any relevant templates and schemas for this function.
thought
, it must have the following additional fields:
text
: The text for the content block.
unknown
, it must have the following additional fields:
data
: The original content block from the provider, without any validation or transformation by TensorZero.model_provider_name
(optional): A string specifying when this content block should be included in the model provider input. If set, the content block will only be provided to this specific model provider. If not set, the content block is passed to all model providers.
daydreaming
content block to inference requests targeting the your_model_provider_name
model provider.
Certain reasoning models (e.g. DeepSeek R1) can include
thought
content blocks in the response.
These content blocks can’t directly be used as inputs to subsequent inferences in multi-turn scenarios.
If you need to provide thought
content blocks to a model, you should convert them to text
content blocks.Example
Example
input.system
- Type: string or object
- Required: no
model_name
- Type: string
- Required: either
model_name
orfunction_name
must be provided
tensorzero::default
.
To call… | Use this format… |
A function defined as [functions.my_function] in your
tensorzero.toml configuration file | function_name="my_function" (not model_name ) |
A model defined as [models.my_model] in your tensorzero.toml
configuration file | model_name="my_model" |
A model offered by a model provider, without defining it in your
tensorzero.toml configuration file (if supported, see below) | model_name="{provider_type}::{model_name}" |
The following model providers support short-hand model names:
anthropic
, deepseek
, fireworks
, gcp_vertex_anthropic
, gcp_vertex_gemini
, google_ai_studio_gemini
, groq
, hyperbolic
, mistral
, openai
, openrouter
, together
, and xai
.tensorzero.toml
function_name="extract-data"
calls theextract-data
function defined above.model_name="gpt-4o"
calls thegpt-4o
model in your configuration, which supports fallback fromopenai
toazure
. See Retries & Fallbacks for details.model_name="openai::gpt-4o"
calls the OpenAI API directly for thegpt-4o
model, ignoring thegpt-4o
model defined above.
Be careful about the different prefixes:
model_name="gpt-4o"
will use the [models.gpt-4o]
model defined in the tensorzero.toml
file, whereas model_name="openai::gpt-4o"
will call the OpenAI API directly for the gpt-4o
model.output_schema
- Type: object (valid JSON Schema)
- Required: no
output_schema
defined in the function configuration for a JSON function.
This dynamic output schema is used for validating the output of the function, and sent to providers which support structured outputs.
parallel_tool_calls
- Type: boolean
- Required: no
true
, the function will be allowed to request multiple tool calls in a single conversation turn.
If not set, we default to the configuration value for the function being called.
Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field.
At the moment, only Fireworks AI and OpenAI support parallel tool calls.
params
- Type: object (see below)
- Required: no (default:
{}
)
{ variant_type: { param: value, ... }, ... }
.
You should prefer to set these parameters in the configuration file if possible.
Only use this field if you need to set these parameters dynamically at runtime.
Note that the parameters will apply to every variant of the specified type.
Currently, we support the following:
chat_completion
frequency_penalty
json_mode
max_tokens
presence_penalty
seed
stop_sequences
temperature
top_p
Example
Example
For example, if you wanted to dynamically override the See “Chat Function with Dynamic Inference Parameters” for a complete example.
temperature
parameter for a chat_completion
variants, you’d include the following in the request body:stream
- Type: boolean
- Required: no
true
, the gateway will stream the response from the model provider.
tags
- Type: flat JSON object with string keys and values
- Required: no
{"user_id": "123"}
or {"author": "Alice"}
.
tool_choice
- Type: string
- Required: no
none
: The function should not use any tools.auto
: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.required
: The model should use a tool. If multiple tools are available, the model decides which tool to use.{ specific = "tool_name" }
: The model should use a specific tool. The tool must be defined in thetools
section of the configuration file or provided inadditional_tools
.
variant_name
- Type: string
- Required: no
Response
The response format depends on the function type (as defined in the configuration file) and whether the response is streamed or not.Chat Function
When the function type ischat
, the response is structured as follows.
In regular (non-streaming) mode, the response is a JSON object with the following fields:
content
- Type: a list of content blocks (see below)
type
equal to text
and tool_call
.
Reasoning models (e.g. DeepSeek R1) might also include thought
content blocks.If type
is text
, the content block has the following fields:text
: The text for the content block.
type
is tool_call
, the content block has the following fields:arguments
(object): The validated arguments for the tool call (null
if invalid).id
(string): The ID of the content block.name
(string): The validated name of the tool (null
if invalid).raw_arguments
(string): The arguments for the tool call generated by the model (which might be invalid).raw_name
(string): The name of the tool generated by the model (which might be invalid).
type
is thought
, the content block has the following fields:text
(string): The text of the thought.
unknown
with the following additional fields:data
: The original content block from the provider, without any validation or transformation by TensorZero.model_provider_name
: The fully-qualified name of the model provider that returned the content block.
your_model_provider_name
returns a content block of type daydreaming
, it will be included in the response like this:episode_id
- Type: UUID
inference_id
- Type: UUID
original_response
- Type: string (optional)
include_original_response
is true
).The returned data depends on the variant type:chat_completion
: raw response from the inference to themodel
experimental_best_of_n_sampling
: raw response from the inference to theevaluator
experimental_mixture_of_n_sampling
: raw response from the inference to thefuser
experimental_dynamic_in_context_learning
: raw response from the inference to themodel
experimental_chain_of_thought
: raw response from the inference to themodel
variant_name
- Type: string
usage
- Type: object (optional)
input_tokens
: The number of input tokens used for the inference.output_tokens
: The number of output tokens used for the inference.
JSON Function
When the function type isjson
, the response is structured as follows.
In regular (non-streaming) mode, the response is a JSON object with the following fields:
inference_id
- Type: UUID
episode_id
- Type: UUID
original_response
- Type: string (optional)
include_original_response
is true
).The returned data depends on the variant type:chat_completion
: raw response from the inference to themodel
experimental_best_of_n_sampling
: raw response from the inference to theevaluator
experimental_mixture_of_n_sampling
: raw response from the inference to thefuser
experimental_dynamic_in_context_learning
: raw response from the inference to themodel
experimental_chain_of_thought
: raw response from the inference to themodel
output
- Type: object (see below)
raw
: The raw response from the model provider (which might be invalid JSON).parsed
: The parsed response from the model provider (null
if invalid JSON).
variant_name
- Type: string
usage
- Type: object (optional)
input_tokens
: The number of input tokens used for the inference.output_tokens
: The number of output tokens used for the inference.
Examples
Chat Function
Chat Function
Chat Function
Configuration
Request
POST /inference
Response
POST /inference
Chat Function with Schemas
Chat Function with Schemas
Chat Function with Schemas
Configuration
Request
POST /inference
Response
POST /inference
Chat Function with Tool Use
Chat Function with Tool Use
Chat Function with Tool Use
Configuration
Request
POST /inference
Response
POST /inference
Chat Function with Multi-Turn Tool Use
Chat Function with Multi-Turn Tool Use
Chat Function with Multi-Turn Tool Use
Configuration
Request
POST /inference
Response
POST /inference
Chat Function with Dynamic Tool Use
Chat Function with Dynamic Tool Use
Chat Function with Dynamic Tool Use
Configuration
Request
POST /inference
Response
POST /inference
Chat Function with Dynamic Inference Parameters
Chat Function with Dynamic Inference Parameters
Chat Function with Dynamic Inference Parameters
Configuration
Request
POST /inference
Response
POST /inference
JSON Function
JSON Function
JSON Function
Configuration
Request
POST /inference
Response
POST /inference