Tool-calling agent evaluation

Local agent benchmarks

Why this comparison exists

Evaluating agents

In his post on agents, Simon Willison defined an agent as an LLM that runs tools in a loop to achieve a goal. This leaderboard takes that definition literally, measuring how the choice of LLM and tool-calling framework affects task completion.

This benchmark compares two distinct approaches to tool-calling. Direct tool calling asks the model to choose one structured function call at a time, receive the result, and decide the next call. Code-mode variants expose a programming environment instead, so the model can loop, branch, inspect intermediate values, reuse variables, and compose multiple tool calls inside one execution.

The code-mode comparison is grounded in Cloudflare's Code Mode, which gives the model a programming environment for tool use, and in Pydantic's Monty, which explains Pydantic's restricted code-mode approach. This leaderboard evaluates Monty against direct tool calling and against a Python equivalent of Code Mode: Cpython running in Docker.

Everything here is done in Python using pydantic-ai. That means the results are not a pure measure of agent capability: Pydantic system prompts, tool schemas, retry behavior, and the choice of Python as the orchestration language can also affect outcomes.

Scoreboard

Agent performance

Model
Metric
How were attempts judged?

Completed run: the runner finished without an execution error.

Logically correct: required business mutations appear in tool_calls with correct key values and no obviously harmful opposite action.

Production ready: logically correct plus no tool errors and no duplicate state-changing actions.

A frontier model judged the saved traces after each run. The recorded tool calls were treated as the source of truth, while final prose was only context.

Scenario x runner heatmap

Each cell is average score over five attempts.