Getting Started with vLLM
This guide shows how to set up a minimal deployment to use the TensorZero Gateway with self-hosted LLMs using vLLM.
We’re using Llama 3.1 in this example, but you can use virtually any model supported by vLLM.
Setup
This guide assumes that you are running vLLM locally with vllm serve meta-llama/Llama-3.1-8B-Instruct
.
Make sure to update the api_base
and model_name
in the configuration below to match your vLLM server and model.
For this minimal setup, you’ll need just two files in your project directory:
Directoryconfig/
- tensorzero.toml
- docker-compose.yml
For production deployments, see our Deployment Guide.
Configuration
Create a minimal configuration file that defines a model and a simple chat function:
Credentials
The api_key_location
field in your model provider configuration specifies how to handle API key authentication:
-
If your endpoint does not require an API key (e.g. vLLM by default):
-
If your endpoint requires an API key, you have two options:
-
Configure it in advance through an environment variable:
You’ll need to set the environment variable before starting the gateway.
-
Provide it at inference time:
The API key can then be passed in the inference request.
-
See the Configuration Reference and the API reference for more details.
In this example, vLLM is running locally without authentication, so we use api_key_location = "none"
.
Deployment (Docker Compose)
Create a minimal Docker Compose configuration:
You can start the gateway with docker compose up
.
Inference
Make an inference request to the gateway: