Learn how to use retries and fallbacks to handle errors and improve reliability with TensorZero.
routing
field.
If we include multiple providers on the list, the gateway will try each one sequentially until one succeeds or all fail.
In the example below, the gateway will first try OpenAI, and if that fails, it will try Azure.
retries
field to a variant to specify the number of times to retry that variant if it fails.
The retry strategy is a truncated exponential backoff with jitter.
In the example below, the gateway will retry the variant four times (i.e. a total of five attempts), with a maximum delay of 10 seconds between retries.
variant_name
.gpt_4o_mini
or claude_3_5_haiku
).
If all of those variants fail, the gateway will sample and attempt the variants with unspecified weights (gemini_1_5_flash_8b
or ministral_8b
).
The gateway will never sample the variants with zero weights (ministral_8b
), unless explicitly pinned at inference time.
gpt_4o_mini_api_key_A
and gpt_4o_mini_api_key_B
).
Each variant leverages a model with providers that use different API keys (OPENAI_API_KEY_A
and OPENAI_API_KEY_B
).
See Credential Management for more details on credential management.
timeouts
field in the corresponding configuration block.
You can define timeouts for non-streaming and streaming requests separately: timeouts.non_streaming.total_ms
corresponds to the total request duration and timeouts.streaming.ttft_ms
corresponds to the time to first token (TTFT).
For example, the following configuration sets a 15-second timeout for non-streaming requests and a 3-second timeout for streaming requests (TTFT) to a particular model provider.
timeout
field (or simply killing the request if you’re using a different client).
routing
fallback applies to each individual model inference separately.