See Performance & Latency Tips for more details on maximizing performance in production settings.
TensorZero Gateway vs. LiteLLM
- TensorZero achieves sub-millisecond latency overhead even at 10,000 QPS.
- LiteLLM degrades at hundreds of QPS and fails entirely at 1,000 QPS.
c7i.xlarge
instance on AWS (4 vCPUs, 8 GB RAM), LiteLLM fails when concurrency reaches 1,000 QPS with the vast majority of requests timing out.
TensorZero Gateway handles 10,000 QPS in the same instance with 100% success rate and sub-millisecond latencies.
Even at low loads where LiteLLM is stable (100 QPS), TensorZero at 10,000 QPS achieves significantly lower latencies.
Building in Rust (TensorZero) led to consistent sub-millisecond latency overhead under extreme load, whereas Python (LiteLLM) becomes a bottleneck even at moderate loads.
Latency Comparison
Latency | LiteLLM Proxy (100 QPS) | LiteLLM Proxy (500 QPS) | LiteLLM Proxy (1,000 QPS) | TensorZero Gateway (10,000 QPS) |
---|---|---|---|---|
Mean | 4.91ms | 7.45ms | Failure | 0.37ms |
50% | 4.83ms | 5.81ms | Failure | 0.35ms |
90% | 5.26ms | 10.02ms | Failure | 0.50ms |
95% | 5.41ms | 13.40ms | Failure | 0.58ms |
99% | 5.87ms | 39.69ms | Failure | 0.94ms |
- We use a
c7i.xlarge
instance on AWS (4 vCPUs, 8 GB RAM) running Ubuntu 24.04.2 LTS. - We use a mock OpenAI inference provider for both benchmarks.
- The load generator, both gateways, and the mock inference provider all run on the same instance.
- We configured
observability.enabled = false
(i.e. disabled logging inferences to ClickHouse) in the TensorZero Gateway to make the scenarios comparable. (Even then, the observability features run asynchronously in the background, so they wouldn’t materially affect latency given a powerful enough ClickHouse deployment.) - The most recent benchmark run was conducted on July 30, 2025. It used TensorZero
2025.5.7
and LiteLLM1.74.9
.