Batch Inference

The batch inference endpoint provides access to batch inference APIs offered by some model providers. These APIs provide inference with large cost savings compared to real-time inference, at the expense of much higher latency (sometimes up to a day). The batch inference workflow consists of two steps: submitting your batch request, then polling for the batch job status until completion. See the Batch Inference API Reference for more details on the batch inference endpoints, and see Integrations for model provider integrations that support batch inference.

Example

You can also find the runnable code for this example on GitHub.

Imagine you have a simple TensorZero function that generates haikus using GPT-4o Mini.

[functions.generate_haiku]
type = "chat"

[functions.generate_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"

You can submit a batch inference job to generate multiple haikus with a single request. Each entry in inputs is equal to the input field in a regular inference request.

curl -X POST http://localhost:3000/batch_inference \
  -H "Content-Type: application/json" \
  -d '{
    "function_name": "generate_haiku",
    "variant_name": "gpt_4o_mini",
    "inputs": [
      {
        "messages": [
          {
            "role": "user",
            "content": "Write a haiku about artificial intelligence."
          }
        ]
      },
      {
        "messages": [
          {
            "role": "user",
            "content": "Write a haiku about general aviation."
          }
        ]
      },
      {
        "messages": [
          {
            "role": "user",
            "content": "Write a haiku about anime."
          }
        ]
      }
    ]
  }'

The response contains a batch_id as well as inference_ids and episode_ids for each inference in the batch.

{
  "batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
  "inference_ids": [
    "019470f0-d34a-77a3-9e59-bcc66db2b82f",
    "019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
    "019470f0-d34a-77a3-9e59-bcecfb7172a0"
  ],
  "episode_ids": [
    "019470f0-d34a-77a3-9e59-bc933973d087",
    "019470f0-d34a-77a3-9e59-bca6e9b748b2",
    "019470f0-d34a-77a3-9e59-bcb20177bf3a"
  ]
}

You can use this batch_id to poll for the status of the job or retrieve the results using the GET /batch_inference/{batch_id} endpoint.

curl -X GET http://localhost:3000/batch_inference/019470f0-db4c-7811-9e14-6fe6593a2652

While the job is pending, the response will only contain the status field.

{
  "status": "pending"
}

Once the job is completed, the response will contain the status field and the inferences field. Each inference object is the same as the response from a regular inference request.

{
  "status": "completed",
  "batch_id": "019470f0-db4c-7811-9e14-6fe6593a2652",
  "inferences": [
    {
      "inference_id": "019470f0-d34a-77a3-9e59-bcc66db2b82f",
      "episode_id": "019470f0-d34a-77a3-9e59-bc933973d087",
      "variant_name": "gpt_4o_mini",
      "content": [
        {
          "type": "text",
          "text": "Whispers of circuits,  \nLearning paths through endless code,  \nDreams in binary."
        }
      ],
      "usage": {
        "input_tokens": 15,
        "output_tokens": 19
      }
    },
    {
      "inference_id": "019470f0-d34a-77a3-9e59-bcdd2f8e06aa",
      "episode_id": "019470f0-d34a-77a3-9e59-bca6e9b748b2",
      "variant_name": "gpt_4o_mini",
      "content": [
        {
          "type": "text",
          "text": "Wings of freedom soar,  \nClouds embrace the lonely flight,  \nSky whispers adventure."
        }
      ],
      "usage": {
        "input_tokens": 15,
        "output_tokens": 20
      }
    },
    {
      "inference_id": "019470f0-d34a-77a3-9e59-bcecfb7172a0",
      "episode_id": "019470f0-d34a-77a3-9e59-bcb20177bf3a",
      "variant_name": "gpt_4o_mini",
      "content": [
        {
          "type": "text",
          "text": "Vivid worlds unfold,  \nHeroes rise with dreams in hand,  \nInk and dreams collide."
        }
      ],
      "usage": {
        "input_tokens": 14,
        "output_tokens": 20
      }
    }
  ]
}

Technical Notes

Observability
- For now, pending batch inference jobs are not shown in the TensorZero UI. You can find the relevant information in the BatchRequest and BatchModelInference tables on ClickHouse. See Data Model for more information.
- Inferences from completed batch inference jobs are shown in the UI alongside regular inferences.
Experimentation
- The gateway samples the same variant for the entire batch.
Python Client
- The TensorZero Python client doesn’t natively support batch inference yet. You’ll need to submit batch requests using HTTP requests, as shown above.

Introduction

Gateway

Optimization

Evaluations

Experimentation

Deployment

Operations

Example

Technical Notes

Introduction

Gateway

Optimization

Evaluations

Experimentation

Deployment

Operations

​Example

​Technical Notes

Example

Technical Notes