API reference for the Batch Inference endpoints.
/batch_inference
endpoints allow users to take advantage of batched inference offered by LLM providers.
These inferences are often substantially cheaper than the synchronous APIs.
The handling and eventual data model for inferences made through this endpoint are equivalent to those made through the main /inference
endpoint with a few exceptions:
chat_completion
are supported.dryrun
setting is not supported.POST /batch_inference
endpoint to submit a batch of requests.
Later, you can poll the GET /batch_inference/{batch_id}
or GET /batch_inference/:batch_id/inference/:inference_id
endpoint to check the status of the batch and retrieve results.
Each poll will return either a pending or failed status or the results of the batch.
Even after a batch has completed and been processed, you can continue to poll the endpoint as a way of retrieving the results.
The first time a batch has completed and been processed, the results are stored in the ChatInference, JsonInference, and ModelInference tables as with the /inference
endpoint.
The gateway will rehydrate the results into the expected result when polled repeatedly after finishing
POST /batch_inference
additional_tools
description
, name
, parameters
, and strict
.
The fields are identical to those in the configuration file, except that the parameters
field should contain the JSON schema itself rather than a path to it.
See Configuration Reference for more details.
allowed_tools
additional_tools
are always allowed, irrespective of this field.
credentials
dynamic
location (e.g. dynamic::my_dynamic_api_key_name
).
See the configuration reference for more details.
The gateway expects the credentials to be provided in the credentials
field of the request body as specified below.
The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.
Example
episode_ids
null
for episode IDs for elements that should start a fresh episode.
Only use episode IDs that were returned by the TensorZero gateway.
function_name
inputs
input
objects (see below)input[].messages
[]
)role
: The role of the message (assistant
or user
).content
: The content of the message (see below).content
field can be have one of the following types:
type
and additional fields depending on the type.
If the content block has type text
, it must have either of the following additional fields:
text
: The text for the content block.arguments
: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see Prompt Templates & Schemas for details).tool_call
, it must have the following additional fields:
arguments
: The arguments for the tool call.id
: The ID for the content block.name
: The name of the tool for the content block.tool_result
, it must have the following additional fields:
id
: The ID for the content block.name
: The name of the tool for the content block.result
: The result of the tool call.file
, it must have exactly one of the following additional fields:
url
: The URL for a remote file.mime_type
and data
: The MIME type (e.g. image/png
, image/jpeg
, application/pdf
) and base64
-encoded data for an embedded file.raw_text
, it must have the following additional fields:
value
: The text for the content block.
This content block will ignore any relevant templates and schemas for this function.thought
, it must have the following additional fields:
text
: The text for the content block.unknown
, it must have the following additional fields:
data
: The original content block from the provider, without any validation or transformation by TensorZero.model_provider_name
(optional): A string specifying when this content block should be included in the model provider input.
If set, the content block will only be provided to this specific model provider.
If not set, the content block is passed to all model providers.daydreaming
content block to inference requests targeting the your_model_provider_name
model provider.
Example
input[].system
output_schemas
output_schema
defined in the function configuration.
This schema is used for validating the output of the function, and sent to providers which support structured outputs.
parallel_tool_calls
null
for elements that should use the configuration value for the function being called.
If you don’t provide this field entirely, we default to the configuration value for the function being called.
Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field.
At the moment, only Fireworks AI and OpenAI support parallel tool calls.
params
{}
){ variant_type: { param: [value1, ...], ... }, ... }
.
You should prefer to set these parameters in the configuration file if possible.
Only use this field if you need to set these parameters dynamically at runtime.
Each parameter if specified should be a list of values that may be null that is the same length as the batch size.
Note that the parameters will apply to every variant of the specified type.
Currently, we support the following:
chat_completion
frequency_penalty
max_tokens
presence_penalty
seed
temperature
top_p
Example
temperature
parameter for a chat_completion
variant for the first inference in a batch of 3, you’d include the following in the request body:tags
[{"user_id": "123"}, null]
or [{"author": "Alice"}, {"author": "Bob"}]
.
tool_choice
none
: The function should not use any tools.auto
: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.required
: The model should use a tool. If multiple tools are available, the model decides which tool to use.{ specific = "tool_name" }
: The model should use a specific tool. The tool must be defined in the tools
section of the configuration file or provided in additional_tools
.variant_name
/batch_inference
, the response is a JSON object containing metadata that allows you to refer to the batch and poll it later on.
The response is an object with the following fields:
batch_id
inference_ids
episode_ids
inputs
is equal to the input
field in a regular inference request.
batch_id
as well as inference_ids
and episode_ids
for each inference in the batch.
GET /batch_inference/:batch_id
{"status": "pending"}
{"status": "failed"}
status
"completed"
batch_id
inferences
batch_id
to poll the status of this job:
status
field.
status
field and the inferences
field.
Each inference object is the same as the response from a regular inference request.
GET /batch_inference/:batch_id/inference/:inference_id
{"status": "pending"}
{"status": "failed"}
status
"completed"
batch_id
inferences
status
field.
status
field and the inferences
field.
Unlike above, this request will return a list containing only the requested inference.