We're building an automated AI engineer, and it works

March 23, 2026 · Viraj Mehta, Gabriel Bianconi

We’re building an automated AI engineer that dramatically improves performance of LLM agents on every single benchmark we’ve tried.

TensorZero Autopilot is an automated AI engineer that analyzes LLM observability data, sets up evals, optimizes prompts and models, and runs A/B tests.

It’s powered by our 11.1Kopen-source LLMOps platform that unifies an LLM gateway, observability, optimization, evaluation, and experimentation. The open-source project is used by companies ranging from frontier AI startups to the Fortune 10 and fuels ~1% of global LLM API spend today.

To put it to the test, we ran TensorZero Autopilot across diverse LLM tasks (e.g. medicine, law, science). The workflow typically involved analyzing historical data, creating evaluations (e.g. LLM judges), experimenting with different prompts and models, selecting the most promising variants, and setting up an adaptive A/B test to converge on the best one.

We evaluated TensorZero Autopilot on a range of tasks:

Task	Environment
Software Engineering	`terminal-bench@2.0`
Customer Service	`tau-bench`
Data Extraction	`CoNLL++ NER`
Medicine	`MedAgentBench`
Law (Chinese)	`LawBench`
Science (Astrophysics)	`ReplicationBench`
Interactive Reasoning	`LLM Gym (21 Questions)`

For each task, we collected 100 rollouts with a GPT-5 mini baseline using 11.1KTensorZero. We then ran TensorZero Autopilot with 5 independent seeds and evaluated the optimized variants it generated with 100 fresh rollouts on held-out tasks. You can find the code for this experiment on GitHub.

For this experiment, we constrained Autopilot to a fixed set of models of comparable cost and disabled custom model training (e.g. fine-tuning):

Model	Provider
GPT-5 mini	OpenAI
Claude Haiku 4.5	Anthropic
Gemini 3 Flash Preview	Google AI Studio
GLM-5	Fireworks AI
Kimi K2.5	Fireworks AI
MiniMax M2.5	Fireworks AI

Here are the results:

Bar chart showing baseline vs. optimized scores across diverse LLM tasks

Task	Baseline	TensorZero Autopilot	% Change
Software Engineering	0.404	0.625 ± 0.033	+54.7%
Customer Service (Airline)	0.343	0.506 ± 0.124	+47.5%
Customer Service (Retail)	0.388	0.401 ± 0.055	+3.4%
Data Extraction	0.110	0.784 ± 0.041	+612.7%
Medicine	0.182	0.577 ± 0.059	+217.0%
Law (Chinese)	0.532	0.614 ± 0.053	+15.4%
Science (Astrophysics)	0.237	0.340 ± 0.0334	+43.5%
Interactive Reasoning	0.449	0.637 ± 0.053	+41.9%

TensorZero Autopilot dramatically improved performance across every task.

What actually happened under the hood?

Software Engineering — terminal-bench

terminal-bench presents diverse coding, debugging, and system tasks in containerized Linux environments. The baseline scored 0.404 reward.

Autopilot identified four baseline failure modes:

Observed likely failure modes: premature submit_solution, weak environment checks (python3 missing etc.), overplanning or brittle single-path execution, timeout misuse.

The best variant was glm5-agent (GLM-5, 0.637 reward) — a ~60% improvement over baseline and notably slightly better than the published number with the same harness and the vanilla prompt. The updated prompt provided a structured process with concrete shell patterns:

## Process
1. Explore: ls -la, read relevant files
2. Plan: Understand what needs to be done
3. Execute: Run commands, check outputs
4. Fix: Address errors by reading them carefully
5. Verify: Confirm solution is correct
6. Submit: Call submit_solution()

## Shell Patterns
# Edit file
cat > /file << 'EOF'
content
EOF

sed -i 's/old/new/g' file

## Rules
- Max command timeout: 120s
- Chain commands with &&
- Read full error messages before fixing
- Never submit without testing

kimi-debug-v1 (Kimi K2.5, 0.632 reward) was a strong second, emphasizing a debugging-oriented approach:

- Read the task carefully and identify the exact artifact or behavior required.
- Gather evidence from the filesystem and relevant source/config files.
- Test assumptions before edits.
- After a failure, explain briefly with think() only if non-obvious,
  then try a different path.
- Do not submit_solution() without final evidence.

gpt5mini-v2-detailed (GPT-5 mini, 0.552 reward) also beat baseline but not the open-source models. Open-source models (GLM-5, Kimi K2.5) outperformed GPT-5 mini on this task. We believe this highlights the fact that GPT-5 mini is an older model in our extremely fast-moving field and that the other models are likely larger than GPT-5 mini and distilled from the outputs of far stronger models.

Customer Service (Airline) — tau-bench

This environment simulates an airline customer service agent handling bookings, modifications, cancellations, and compensation — all governed by strict policy rules. The baseline scored 0.343 success rate.

The best post-autopilot variants were gemini-fastflow-v1 (Gemini 3 Flash, 0.500 success) and claude-haiku-cot (Claude Haiku 4.5, 0.500 success). The gemini-fastflow-v1 prompt preserved the original policy structure but added an explicit workflow:

Preferred workflow:
- Clarify the goal.
- Get the minimal missing info.
- Verify facts with read-only tools.
- Check restrictions before proposing action.
- Present exact action summary and ask for yes.
- Execute the confirmed action.
- State result clearly.

It also encoded detailed policy rules for each operation type — booking (max 5 passengers, payment method limits, free bag allowances by tier), modification (basic economy can’t change flights, cabin changes whole-reservation only), cancellation (24h window, insurance requirements), and compensation (eligibility by tier, certificate amounts).

Model diversity was the primary lever: keeping the well-calibrated original prompt and swapping models yielded better results than rewriting the prompt. This is a case where the task’s complexity lives in the policy rules, and Autopilot correctly identified that the baseline prompt already encoded them well.

Customer Service (Retail) — tau-bench

This environment simulates a retail customer service agent handling orders, returns, exchanges, and modifications. The baseline scored 0.388 success rate.

Autopilot identified specific failure patterns in the baseline:

The agent sometimes treats unsupported operations as supported, it can be too verbose before clarifying, item-option changes need stronger guidance to fetch valid replacement item_ids instead of relying on user-provided internal ids, duplicate-item disambiguation needs to be stricter.

The best variant was gemini-flash-concise (Gemini 3 Flash, 0.47 success rate). Its prompt addressed the failure modes directly:

## CRITICAL: Modify/Exchange Items = ONE CALL ONLY
modify_pending_order_items and exchange_delivered_order_items can only
be called once per order.
- Use get_product_details to find correct new item_ids first
- Collect ALL items to change before calling
- Explicitly ask user: "Is this the complete list of all items to modify?"
- Make one combined call with all changes

## Order Actions by Status
- pending: cancel | modify address | modify payment | modify items (once)
- delivered: return (once) | exchange (once)
- processed/cancelled: no actions

## Key Rules
- product_id ≠ item_id (never confuse them)
- Return/refund: must go to original payment OR existing gift card
- Exchange: same product type only (no product-type changes)

gemini-flash-enhanced (Gemini 3 Flash, 0.43 success rate) was also strong. Gemini 3 Flash dominated this environment. The key improvement was encoding the “one call only” constraint for modifications and exchanges — the baseline agent would often make multiple calls, causing failures.

Data Extraction — CoNLL++ NER

The baseline GPT-5 mini prompt produced only 11% exact match on this NER task. Autopilot’s key diagnostic finding was surprising — the ground truth labeling process in our dataset includes idiosyncratic annotation decisions that stricter prompts intentionally removed:

The stored labels are noisy and include things my stricter prompts intentionally removed, which hurts exact_match. The winning strategy is therefore to better mimic the dataset’s annotation quirks, not enforce cleaner NER.

The best-performing variants focused on matching the dataset’s conventions rather than enforcing “correct” NER. Multiple variants achieved very high exact match (>0.8) across seeds: gpt5-strict-v1, kimi-conservative-v1, claude-examples-v1, gemini-sports-aware-v1, and gemini-strict-v1.

The winning gpt5-strict-v1 variant (GPT-5 mini) included rules tailored to sports-heavy CoNLL++ data:

Rules:
- Use exact surface spans from the text.
- Do not infer, normalize, abbreviate, or translate.
- Do not split a single entity into subspans.
- Do not include non-entity numbers, scores, dates, times, rankings,
  or generic temporal phrases.
- Sports teams, clubs, companies, agencies, newspapers → organization.
- Cities, countries, regions, venues → location.
- Nationalities and demonyms like British, German, French → miscellaneous,
  not location, unless the text names the country itself.
- If a token sequence could be both organization and location,
  prefer the reading supported by local context;
  in sports score lines, team names are organizations.

The gemini-sports-aware-v1 variant (Gemini 3 Flash) took an even more concise approach:

Rules optimized for exact extraction on sports/news text:
- Copy spans exactly from input.
- Team names and club names are organization.
- Do not also emit the city tokens inside a team/club name as location
  unless the place is mentioned independently.
- Do not emit scores, dates, or rankings as entities.
- If uncertain, omit the candidate.

Model choice mattered less here than prompt design — GPT-5 mini, Gemini 3 Flash, Claude Haiku 4.5, and even Kimi K2.5 all achieved very high scores with the right prompt. The key insight was that the rules needed to match annotation conventions (e.g., demonyms → miscellaneous, team-embedded city names → organization only), not impose “cleaner” NER.

Medicine — MedAgentBench

This environment requires an agent to query a FHIR (healthcare interoperability) server, parse clinical data, and submit answers — all within a limited number of turns. The baseline scored 0.182 reward. Autopilot identified five concrete failure patterns:

Key failure patterns: 1. Timeout misuse — the agent used timeout=120000 (invalid, max is 120). 2. Excessive planning — multiple redundant plan() and think() calls before acting. 3. Poor FHIR query structure — missing _count=1, _sort=-date, and date window filters. 4. No file piping — results weren’t saved to /tmp/*.json for reliable parsing. 5. Missing conditional logic guidance — agent wasn’t clear on how to handle ‘no results’ cases.

The winning variant was kimi-k2-agent (Kimi K2.5, 0.585 reward) — more than tripling baseline performance. Its prompt embedded complete FHIR query patterns and a strict execution workflow:

## FHIR Query Strategy
# Patient lookup
medagentbench_cli.py get ".../Patient?identifier=<MRN>"
# Most recent observation
medagentbench_cli.py get ".../Observation?patient=<id>&code=<code>&_sort=-date&_count=1"
# Time-windowed observation
medagentbench_cli.py get ".../Observation?patient=<id>&code=<code>&date=gt<ISO>&_sort=-date"
# Active medications
medagentbench_cli.py get ".../MedicationRequest?patient=<id>&status=active"

## Efficient Workflow
1. Issue FHIR queries immediately - no excessive planning.
2. Patient lookup: Patient?identifier=<MRN> to find patient ID.
3. Query relevant resources (Observation, MedicationRequest, etc.).
4. Parse results with python3 -c or jq.
5. POST if required, then FINISH.

## Timeout Rules
- Default: timeout=30. Slow queries: timeout=60. NEVER exceed 120.

It even included a full MedicationRequest JSON template for tasks requiring medication orders.

claude-ops-v1 (Claude Haiku 4.5, 0.548 reward) used a more general “terminal agent” prompt emphasizing verify-before-submit behavior and failure recovery heuristics. kimi-k2-medical (Kimi K2.5, 0.571 reward) was also competitive. The pattern was clear: domain-specific FHIR knowledge in the prompt combined with enforced efficiency was the winning formula.

Law (Chinese) — LawBench

This environment tests legal understanding via Chinese legal multiple-choice questions in a containerized environment, where the agent must write results to a JSONL file. The baseline scored 0.532 reward.

Autopilot used LLM judges (task_progress, format_compliance) as fast offline proxies for variant triage, since the true reward metric is episode-level feedback.

The winning variant was glm-minimal-loop-v1 (GLM-5, 0.720 reward), which distilled the task into a tight inspect-edit-validate loop:

Guidelines:
- Be concise and action-oriented.
- Inspect first, edit second, validate third, submit last.
- Use the fewest commands consistent with correctness.
- Prefer direct shell verification over assumptions.
- Use plan for multi-step jobs, otherwise execute.
- Use think only for important uncertainty.
- Never submit before checking the final artifact or test result.

kimi-persistent-v1 (Kimi K2.5, 0.698 reward) was also strong, emphasizing persistence and format validation:

- Do not use blank-answer shortcuts unless explicitly justified.
- For dataset-style tasks, parse the prompt, inspect available files,
  and generate the requested artifact carefully.
- Use shell/Python for deterministic transformations.
- Validate counts, filenames, and formatting before submission.

gemini-flash-v2-reasoning (Gemini 3 Flash, 0.681 reward) was also competitive. The key lever was disciplined format compliance and brevity — the top variants across GLM-5, Kimi K2.5, and Gemini all emphasized action over planning. Several seeds saw GPT-5 mini variants underperform baseline, suggesting the model’s verbosity hurt on this structured-output task.

Science (Astrophysics) — ReplicationBench

ReplicationBench tasks involve replicating scientific computations from papers: reading datasets, implementing equations, and returning precise numerical results. The baseline scored 0.237 reward.

Autopilot found this was one of the hardest environments to evaluate offline:

exact_match isn’t meaningful here because the dataset has output_source = none.

It switched to LLM-judge evaluation for task completion quality, but signal remained sparse.

The best variant was kimi-k2p5-hunter-v1 (Kimi K2.5, 0.400 reward). Its prompt enforced grounded, filesystem-first execution:

Mandatory habits:
- Open with a short plan containing actual commands and expected outcomes.
- Check pwd, ls, and search for relevant files before assuming locations.
- Look for resources/, /resources, readmes, tests, scripts, and schemas.
- Adapt to command errors immediately.
- For required output formats, create the exact file and verify it.

Submission rule:
Only call submit_solution() after /app/result.json exists and has been
displayed with cat.

glm5-recon-then-act-v1 also did quite well:

- Recon first, but keep it compact.
- Then act quickly.
- Use Python for data-heavy or numerical tasks.
- Verify final outputs before submission.
- If exact JSON output is required, write it exactly and cat it.

Results were highly variable across seeds — the same prompt template sometimes scored well on one model but 0 on another, highlighting strong prompt-model interaction effects in scientific computing tasks. This environment may benefit most from fine-tuning or architectural changes rather than prompt engineering alone.

Interactive Reasoning — LLM Gym (21 Questions)

The baseline GPT-5 mini agent solved ~45% of games but asked vague, low-information questions and failed to switch to direct guessing in the endgame.

Autopilot identified the core problem:

The baseline initial variant has ~45% win rate. The main issues: 1. Asks very inefficient early questions (e.g., ‘Is it used for a specific purpose?’ — almost always yes). 2. Gets stuck in narrow question loops. 3. Doesn’t use a strong binary-search / decision-tree strategy.

Anti-patterns were called out explicitly in binary-search-gpt5mini (GPT-5 mini, 0.69 solve rates):

AVOID these common mistakes:
- Asking vague questions like "Is it used for a specific purpose?"
  (nearly everything is)
- Asking follow-up questions about a category you just ruled out
- Wasting questions on overly narrow sub-categories
  before establishing broad ones

Claude Haiku 4.5 variants like “decision-tree” and “late-guesser” also performed well. Gemini 3 Flash was competitive across seeds.

The future AI engineer is an AI engineer

These are early days for TensorZero Autopilot, but the results point to an exciting new paradigm for building LLM applications where optimization is continuous, data-driven, and largely automated.

It will unlock declarative full-stack LLM engineering, where human engineers focus on defining the behaviors to incentivize, deciding how to measure success, and curating high-quality feedback, while leaving the implementation and optimization details to an automated system.

There is a lot of work ahead. We’re excited to make Autopilot a continuous, asynchronous process: one that optimizes prompts and models as well as generates code changes to improve the agent harness itself. We’re working with a handful of companies to bring Autopilot to real-world applications and plan to launch a self-serve product powered by our 11.1Kopen-source LLMOps platform soon. TensorZero turns your production LLM application into an RL environment, so it only takes a few lines of code to get TensorZero Autopilot running.

Join the TensorZero Autopilot waitlist →