/batch_inference
endpoints allow users to take advantage of batched inference offered by LLM providers.
These inferences are often substantially cheaper than the synchronous APIs.
The handling and eventual data model for inferences made through this endpoint are equivalent to those made through the main /inference
endpoint with a few exceptions:
- The batch samples a single variant from the function being called.
- There are no fallbacks or retries for bached functions.
- Only variants of type
chat_completion
are supported. - Caching is not supported.
- The
dryrun
setting is not supported. - Streaming is not supported.
POST /batch_inference
endpoint to submit a batch of requests.
Later, you can poll the GET /batch_inference/{batch_id}
or GET /batch_inference/:batch_id/inference/:inference_id
endpoint to check the status of the batch and retrieve results.
Each poll will return either a pending or failed status or the results of the batch.
Even after a batch has completed and been processed, you can continue to poll the endpoint as a way of retrieving the results.
The first time a batch has completed and been processed, the results are stored in the ChatInference, JsonInference, and ModelInference tables as with the /inference
endpoint.
The gateway will rehydrate the results into the expected result when polled repeatedly after finishing
See the Batch Inference Guide for a simple example of using the batch inference endpoints.
POST /batch_inference
Request
additional_tools
- Type: list of lists of tools (see below)
- Required: no (default: no additional tools)
description
, name
, parameters
, and strict
.
The fields are identical to those in the configuration file, except that the parameters
field should contain the JSON schema itself rather than a path to it.
See Configuration Reference for more details.
allowed_tools
- Type: list of lists of strings
- Required: no
additional_tools
are always allowed, irrespective of this field.
credentials
- Type: object (a map from dynamic credential names to API keys)
- Required: no (default: no credentials)
dynamic
location (e.g. dynamic::my_dynamic_api_key_name
).
See the configuration reference for more details.
The gateway expects the credentials to be provided in the credentials
field of the request body as specified below.
The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.
Example
Example
episode_ids
- Type: list of UUIDs
- Required: no
null
for episode IDs for elements that should start a fresh episode.
Only use episode IDs that were returned by the TensorZero gateway.
function_name
- Type: string
- Required: yes
inputs
- Type: list of
input
objects (see below) - Required: yes
input[].messages
- Type: list of messages (see below)
- Required: no (default:
[]
)
role
: The role of the message (assistant
oruser
).content
: The content of the message (see below).
content
field can be have one of the following types:
- string: the text for a text message (only allowed if there is no schema for that role)
- list of content blocks: the content blocks for the message (see below)
type
and additional fields depending on the type.
If the content block has type text
, it must have either of the following additional fields:
text
: The text for the content block.arguments
: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see Prompt Templates & Schemas for details).
tool_call
, it must have the following additional fields:
arguments
: The arguments for the tool call.id
: The ID for the content block.name
: The name of the tool for the content block.
tool_result
, it must have the following additional fields:
id
: The ID for the content block.name
: The name of the tool for the content block.result
: The result of the tool call.
file
, it must have exactly one of the following additional fields:
url
: The URL for a remote file.mime_type
anddata
: The MIME type (e.g.image/png
,image/jpeg
,application/pdf
) andbase64
-encoded data for an embedded file.
raw_text
, it must have the following additional fields:
value
: The text for the content block. This content block will ignore any relevant templates and schemas for this function.
thought
, it must have the following additional fields:
text
: The text for the content block.
unknown
, it must have the following additional fields:
data
: The original content block from the provider, without any validation or transformation by TensorZero.model_provider_name
(optional): A string specifying when this content block should be included in the model provider input. If set, the content block will only be provided to this specific model provider. If not set, the content block is passed to all model providers.
daydreaming
content block to inference requests targeting the your_model_provider_name
model provider.
Example
Example
input[].system
- Type: string or object
- Required: no
output_schemas
- Type: list of optional objects (valid JSON Schema)
- Required: no
output_schema
defined in the function configuration.
This schema is used for validating the output of the function, and sent to providers which support structured outputs.
parallel_tool_calls
- Type: list of optional booleans
- Required: no
null
for elements that should use the configuration value for the function being called.
If you don’t provide this field entirely, we default to the configuration value for the function being called.
Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field.
At the moment, only Fireworks AI and OpenAI support parallel tool calls.
params
- Type: object (see below)
- Required: no (default:
{}
)
{ variant_type: { param: [value1, ...], ... }, ... }
.
You should prefer to set these parameters in the configuration file if possible.
Only use this field if you need to set these parameters dynamically at runtime.
Each parameter if specified should be a list of values that may be null that is the same length as the batch size.
Note that the parameters will apply to every variant of the specified type.
Currently, we support the following:
chat_completion
frequency_penalty
max_tokens
presence_penalty
seed
temperature
top_p
Example
Example
For example, if you wanted to dynamically override the
temperature
parameter for a chat_completion
variant for the first inference in a batch of 3, you’d include the following in the request body:tags
- Type: list of optional JSON objects with string keys and values
- Required: no
[{"user_id": "123"}, null]
or [{"author": "Alice"}, {"author": "Bob"}]
.
tool_choice
- Type: list of optional strings
- Required: no
none
: The function should not use any tools.auto
: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.required
: The model should use a tool. If multiple tools are available, the model decides which tool to use.{ specific = "tool_name" }
: The model should use a specific tool. The tool must be defined in thetools
section of the configuration file or provided inadditional_tools
.
variant_name
- Type: string
- Required: no
Response
For a POST request to/batch_inference
, the response is a JSON object containing metadata that allows you to refer to the batch and poll it later on.
The response is an object with the following fields:
batch_id
- Type: UUID
inference_ids
- Type: list of UUIDs
episode_ids
- Type: list of UUIDs
Example
Imagine you have a simple TensorZero function that generates haikus using GPT-4o Mini.inputs
is equal to the input
field in a regular inference request.
batch_id
as well as inference_ids
and episode_ids
for each inference in the batch.
GET /batch_inference/:batch_id
Both this and the following GET endpoint can be used to poll the status of a batch.
If you use this endpoint and poll with only the batch ID the entire batch will be returned if possible.
The response format depends on the function type as well as the batch status when polled.
Pending
{"status": "pending"}
Failed
{"status": "failed"}
Completed
status
- Type: literal string
"completed"
batch_id
- Type: UUID
inferences
- Type: list of objects that exactly match the response body in the inference endpoint documented here.
Example
Extending the example from above: you can use thebatch_id
to poll the status of this job:
status
field.
status
field and the inferences
field.
Each inference object is the same as the response from a regular inference request.
GET /batch_inference/:batch_id/inference/:inference_id
This endpoint can be used to poll the status of a single inference in a batch.
Since the polling involves pulling data on all the inferences in the batch, we also store the status of all those inference in ClickHouse.
The response format depends on the function type as well as the batch status when polled.
Pending
{"status": "pending"}
Failed
{"status": "failed"}
Completed
status
- Type: literal string
"completed"
batch_id
- Type: UUID
inferences
- Type: list containing a single object that exactly matches the response body in the inference endpoint documented here.
Example
Similar to above, we can also poll a particular inference:status
field.
status
field and the inferences
field.
Unlike above, this request will return a list containing only the requested inference.