Usage
The TensorZero Gateway supports the following cache modes:off(default): Disable caching completelyon: Both read from and write to cachewrite_only: Only write to cache but don’t serve cached responsesread_only: Only read from cache but don’t write new entries
Example
Cache Backend
If ClickHouse is the primary data store for the gateway, we store cache data in ClickHouse. If Postgres is configured to be the primary data store for the gateway, and Valkey is available (i.e.TENSORZERO_VALKEY_URL is set), we store cache data in Valkey.
Valkey cache entries have a configurable TTL (time-to-live) that defaults to 24 hours (86400 seconds).
You can change this in tensorzero.toml:
tensorzero.toml
volatile-ttl eviction policy will correctly evict cache entries before rate limiting keys under memory pressure.
See Deploy Valkey / Redis for more details on eviction policies.
Technical Notes
- The cache applies to individual model requests, not inference requests. This means that the following will be cached separately: multiple variants of the same function; multiple calls to the same function with different parameters; individual model requests for inference-time optimizations; and so on.
- The
max_age_sparameter applies to the retrieval of cached responses. When using ClickHouse, old entries are not automatically deleted. When using Valkey, entries expire according to the configured TTL (cache.valkey.ttl_s). The default is 24h. - When the gateway serves a cached response, the usage fields are set to zero.
- For batch inference, the gateway only writes to the cache but does not serve cached responses.
- Inference caching also works for embeddings, using the same cache modes and options as chat completion inference. Caching works for single embeddings. Batch embedding requests (multiple inputs) will write to the cache but won’t serve cached responses.