Why this comparison exists
Evaluating agents
In his post on agents, Simon Willison defined an agent as an LLM that runs tools in a loop to achieve a goal
.
This leaderboard takes that definition literally, measuring how the choice of LLM and tool-calling framework affects task completion.
This benchmark compares two distinct approaches to tool-calling. Direct tool calling asks the model to choose one structured function call at a time, receive the result, and decide the next call. Code-mode variants expose a programming environment instead, so the model can loop, branch, inspect intermediate values, reuse variables, and compose multiple tool calls inside one execution.
The code-mode comparison is grounded in Cloudflare's Code Mode, which gives the model a programming environment for tool use, and in Pydantic's Monty, which explains Pydantic's restricted code-mode approach. This leaderboard evaluates Monty against direct tool calling and against a Python equivalent of Code Mode: Cpython running in Docker.
Everything here is done in Python using pydantic-ai. That means the results are not a pure measure of agent capability: Pydantic system prompts, tool schemas, retry behavior, and the choice of Python as the orchestration language can also affect outcomes.
Scoreboard
Agent performance
How were attempts judged?
Completed run: the runner finished without an execution error.
Logically correct: required business mutations appear in tool_calls with correct key values and no obviously harmful opposite action.
Production ready: logically correct plus no tool errors and no duplicate state-changing actions.
A frontier model judged the saved traces after each run. The recorded tool calls were treated as the source of truth, while final prose was only context.
Scenario x runner heatmap
Each cell is average score over five attempts.