GTA-2: Benchmarking General Tool Agents from Atomic Use to Open-Ended Workflows

TL;DR

GTA-2 reveals a capability cliff: frontier models already fail below 50% on atomic (single-step) tool use, and nearly all fail on open-ended multi-step workflows, with top models at 14.39% success. The key finding is that execution harness design matters more than the underlying model — Manus and OpenClaw frameworks substantially boost workflow completion beyond what model capability alone predicts.

Key Findings

Two-tier benchmark:

GTA-Atomic: short-horizon, closed-ended, single-tool precision tasks (inherited from GTA-1)
GTA-Workflow: long-horizon, open-ended, multi-tool coordination tasks

Results:

Frontier models: below 50% on GTA-Atomic
Top models on GTA-Workflow: 14.39% success
Advanced frameworks (Manus, OpenClaw): substantially better than same model without framework

Evaluation mechanism for open-ended tasks: Recursive checkpoint-based evaluation — decomposes the open-ended objective into verifiable sub-goals. Each sub-goal can be independently checked. This solves the binary success/fail problem for tasks with no single correct output.

Harness design matters: The finding that the execution framework matters more than the model is the most consequential result. This shifts focus from model capability to scaffolding architecture — consistent with the Claude Code analysis (04-17, 04-19) showing that the while-loop is trivial; the surrounding systems determine real-world performance.

Connection to Prior Agent Benchmarks

This is the third agent benchmark in the wiki this week:

OccuBench (04-16): 100 professional task scenarios, Language World Models
DR3-Eval (04-18): deep research benchmark with static corpus sandboxes
GTA-2 (04-20): tool use from atomic to workflow, real queries + deployed tools

Together they're converging on the same finding from different angles: models can't yet complete realistic multi-step tasks reliably. OccuBench found this in professional tasks. DR3-Eval found this in research tasks. GTA-2 finds it in tool-use workflows. Three papers, same measurement, three weeks.

Relations to Prior Wiki Pages

Claude Code architecture (04-17, 04-19): GTA-2 validates the architecture observation — harness design is the differentiator. Claude Code's ML-based permission system, 5-layer compaction, and extensibility mechanisms are exactly the kinds of harness features GTA-2 shows matter.
VAKRA (04-16): VAKRA documented tool-use failure modes. GTA-2 quantifies the performance gap those failure modes produce at scale.
Exploration/Exploitation (04-16): GTA-2's workflow failures map to the exploit-too-early failure mode — agents commit to an approach before exploring sufficiently.

Raw Source

→ raw/huggingface/2026-04-20-gta-2-benchmarking-general-tool-agents-from-atomic-tool-use.md