- Resources: Define what you’re limiting (like model inferences or tokens) and the time window (per second, hour, day, week, or month). For example, “1000 model inferences per day” or “500,000 tokens per hour”.
- Priority: Control which rules take precedence when multiple rules could apply to the same request. Higher priority numbers override lower ones.
- Scope: Determine which requests the rule applies to. You can set global limits for all requests, or targeted limits using custom tags like user IDs.
Learn rate limiting concepts
Let’s start with a brief tutorial on the concepts behind custom rate limits in TensorZero. You can define custom rate limiting rules in your TensorZero configuration using[[rate_limiting.rules]]
.
Your configuration can have multiple rules.
Resources
Each rate limiting rule can have one or more resource limits. A resource limit is defined using theRESOURCE_per_WINDOW
syntax.
For example:
tensorzero.toml
You must specify
max_tokens
for a request if a token limit applies to it.
The gateway makes a reasonably conservative estimate of token usage and later records the actual usage.Scope
Each rate limiting rule can optionally have a scope. The scope restricts the rule to certain requests only. If you don’t specify a scope, the rule will apply to all requests.Tags
At the moment, only user-definedtags
are supported.
You can limit the scope to specific values, to each individual value (tensorzero::each
), or to every value collectively (tensorzero::total
).
For example, the following rule would only apply to inference requests with the tag user_id
set to intern
:
tensorzero.toml
user_id
set to intern
and the tag env
set to production
:
tensorzero.toml
tags
support two special strings for tag_value
:
tensorzero::each
: The rule independently applies to everytag_key
value.tensorzero::total
: The limits are summed across all values of the tag.
user_id
tag individually (i.e. each user gets their own limit):
tensorzero.toml
tensorzero.toml
The rule above does not apply to requests that do not specify any
user_id
value.Priority
Each rate limiting rule must have a priority (e.g.priority = 1
).
The gateway iterates through the rules in order of priority, starting with the highest priority, until it finds a matching rate limit; once it does, it enforces all rules with that priority number and disregards any lower priority rules.
For example, the configuration below would enforce the first rule for requests with user_id = "intern"
and the second rule for all other user_id
values:
tensorzero.toml
always = true
to enforce the rule regardless of other rules; rules with always = true
do not affect the priority calculation above.
Set up rate limits
Let’s set up rate limits for an application to restrict usage depending on an user-defined tag for user IDs.You can find a complete runnable example of this guide on GitHub.
1
Set up Postgres
You must set up Postgres to use TensorZero’s rate limiting features.See the Deploy Postgres guide for instructions.
2
Configure rate limiting rules
Add to your TensorZero configuration:Make sure to reload your gateway.
config/tensorzero.toml
3
Make inference requests
If we make two consecutive inference requests with
user_id = "intern"
, the second one should fail because of rule [B]
.
However, if we make two consecutive inference requests with user_id = "ceo"
, the second one should succeed because rule [C]
will override rule [B]
.- Python (TensorZero SDK)
- Python (OpenAI SDK)
Advanced
Customize capacity and refill rate
By default, rate limits use a simple bucket model where the entire capacity refills at the start of each time window. For example,tokens_per_minute = 100_000
allows 100,000 tokens every minute, with the full allowance resetting at the top of each minute.
However, you can customize this behavior using the capacity
and refill_rate
parameters to create a token bucket that refills continuously:
capacity
parameter sets the maximum number of tokens that can be stored in the bucket, while the refill_rate
determines how many tokens are added to the bucket per time window (10,000 per minute).
This creates smoother rate limiting behavior where instead of getting your full allowance at the start of each minute: you get 10,000 tokens added every minute, up to a maximum of 100,000 tokens stored at any time.
To achieve these benefits, you’ll typically want to use a low time granularity with a capacity
much larger than the refill_rate
.
This approach is particularly useful for burst protection (users can’t consume their entire daily allowance in the first few seconds), smoother traffic distribution (requests are naturally spread out over time rather than clustering at window boundaries), and a better user experience (users get a steady trickle of quota rather than having to wait for the next time window).