Usage
We provide atensorzero/evaluations Docker image for easy usage.
We strongly recommend using TensorZero Evaluations CLI with Docker Compose to keep things simple.
docker-compose.yml
Building from Source
Building from Source
You can build the TensorZero Evaluations CLI from source if necessary. See our GitHub repository for instructions.
Inference Caching
TensorZero Evaluations uses Inference Caching to improve inference speed and cost. By default, it will read from and write to the inference cache. Soon, you’ll be able to customize this behavior.Environment Variables
TENSORZERO_CLICKHOUSE_URL
- Example:
TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@localhost:8123/database_name - Required: yes
Model Provider Credentials
- Example:
OPENAI_API_KEY=sk-... - Required: no
--gateway-url flag below), you don’t need to provide these credentials to the evaluations tool.
If you’re using a built-in gateway (no --gateway-url flag), you must provide same credentials the gateway would use.
See Integrations for more information.
CLI Flags
--adaptive-stopping-precision EVALUATOR=PRECISION[,...]
- Example:
--adaptive-stopping-precision exact_match=0.13,llm_judge=0.16 - Required: no (default: none)
--config-file PATH
- Example:
--config-file /path/to/tensorzero.toml - Required: no (default:
./config/tensorzero.toml)
--concurrency N (-c)
- Example:
--concurrency 5 - Required: no (default:
1)
--datapoint-ids ID[,ID,...]
- Example:
--datapoint-ids 01957bbb-44a8-7490-bfe7-32f8ed2fc797,01957bbb-44a8-7490-bfe7-32f8ed2fc798 - Required: Either
--dataset-nameor--datapoint-idsmust be provided (but not both)
This flag is mutually exclusive with
--dataset-name and --max-datapoints.
You must provide either --dataset-name or --datapoint-ids, but not both.--dataset-name NAME (-d)
- Example:
--dataset-name my_dataset - Required: Either
--dataset-nameor--datapoint-idsmust be provided (but not both)
This flag is mutually exclusive with
--datapoint-ids. You must provide
either --dataset-name or --datapoint-ids, but not both.--evaluation-name NAME (-e)
- Example:
--evaluation-name my_evaluation - Required: yes
--format FORMAT (-f)
- Options:
pretty,jsonl - Example:
--format jsonl - Required: no (default:
pretty)
jsonl format if you want to programatically process the evaluation results.
--gateway-url URL
- Example:
--gateway-url http://localhost:3000 - Required: no (default: none)
--inference-cache MODE
- Options:
on,read_only,write_only,off - Example:
--inference-cache read_only - Required: no (default:
on)
--max-datapoints N
- Example:
--max-datapoints 100 - Required: no
This flag can only be used with
--dataset-name. It cannot be used with
--datapoint-ids.--variant-name NAME (-v)
- Example:
--variant-name gpt_4o - Required: yes
Exit Status
The evaluations process exits with a status code of0 if the evaluation was successful, and a status code of 1 if the evaluation failed.
If you configure a cutoff for any of your evaluators, the evaluation will fail if the average score for any evaluator is below its cutoff.