Documentation Index
Fetch the complete documentation index at: https://www.tensorzero.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with self-hosted LLMs using SGLang.
We’re using Llama-3.1-8B-Instruct in this example, but you can use virtually any model supported by SGLang.
Setup
This guide assumes that you are running SGLang locally with this command (see SGLang’s installation guide):
docker run --gpus all \
# Set shared memory size - needed for loading large models and processing requests
--shm-size 32g \
-p 30000:30000 \
# Mount the host's ~/.cache/huggingface directory to the container's /root/.cache/huggingface directory
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
Make sure to update the api_base in the configuration below to match your SGLang server.
For this minimal setup, you’ll need just two files in your project directory:
- config/
- tensorzero.toml
- docker-compose.yml
You can also find the complete code for this example on GitHub.
For production deployments, see our Deployment Guide.
Configuration
Create a minimal configuration file that defines a model and a simple chat function:
[models.llama]
routing = ["sglang"]
[models.llama.providers.sglang]
type = "sglang"
api_base = "http://host.docker.internal:8080/v1/" # for SGLang running locally on the host
api_key_location = "none" # by default, SGLang requires no API key
model_name = "my-sglang-model"
[functions.my_function_name]
type = "chat"
[functions.my_function_name.variants.my_variant_name]
type = "chat_completion"
model = "llama"
Credentials
The api_key_location field in your model provider configuration specifies how to handle API key authentication:
-
If your endpoint does not require an API key (e.g. SGLang by default):
api_key_location = "none"
-
If your endpoint requires an API key, you have two options:
-
Configure it in advance through an environment variable:
api_key_location = "env::ENVIRONMENT_VARIABLE_NAME"
You’ll need to set the environment variable before starting the gateway.
-
Provide it at inference time:
api_key_location = "dynamic::ARGUMENT_NAME"
The API key can then be passed in the inference request.
See the Credential Management guide, the Configuration Reference, and the API reference for more details.
In this example, SGLang is running locally without authentication, so we use api_key_location = "none".
Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/deployment/tensorzero-gateway
services:
gateway:
image: tensorzero/gateway
volumes:
- ./config:/app/config:ro
command: --config-file /app/config/tensorzero.toml
# environment:
# SGLANG_API_KEY: ${SGLANG_API_KEY:?Environment variable SGLANG_API_KEY must be set.}
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
You can start the gateway with docker compose up.
Inference
Make an inference request to the gateway:
curl -X POST http://localhost:3000/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tensorzero::function_name::my_function_name",
"messages": [
{
"role": "user",
"content": "What is the capital of Japan?"
}
]
}'