agentic-systems · Tier 2

Agent Evaluation & Benchmarks

Agent Evaluation & Benchmarks

A growing ecosystem of benchmarks specifically designed for agentic AI — measuring not just accuracy but exploration/exploitation, long-horizon task completion, tool use, robustness, and professional domain coverage.

Current State (as of 2026-05-14)

Latest additions (2026-05-14): Two papers extend the benchmark-skepticism thread that has been building since Soohak (05-12). AgentLens (summary) shows that 10.7% of passing SWE-bench Verified trajectories are Lucky Passes — regression cycles, blind retries, missing verification, or temporally disordered work. The framework merges per-task passing trajectories into a Prefix Tree Acceptor reference space and uses a context-sensitive intent-stage labeler (Exploration / Implementation / Verification / Orchestration). Some model backends drop 5 ranking positions when scored by quality instead of pass rate, which means pass-rate alone is misleading for between-model comparison. AgentLens-Bench: 1,815 trajectories from 47 tasks across 8 model backends. AssetOpsBench retrospective (raw) reports public-to-hidden score correlation of −0.13 on 234 submissions to the CODS 2025 challenge — public standing does not predict hidden robustness. Three papers in three days (Soohak, AgentLens, AssetOps) all say the same thing from different angles: aggregate leaderboard metrics over-aggregate. MAP (Map-then-Act) (summary) is the architectural complement to AgentLens: frontier models surpass near-zero ARC-AGI-3 baselines in 22 of 25 game environments when they build the environment prior before acting. Training on map-then-act trajectories beats training on expert execution traces, which reframes what good demonstration data looks like for long-horizon agents.

Prior State (as of 2026-05-07)

Standard LLM benchmarks underserve agents. The field has been building agent-specific eval frameworks across several dimensions: decision-making quality, professional domain coverage, multimodal grounding, and robustness under fault injection. Nine benchmarks (OccuBench, GTA-2, DR3-Eval, PRL-Bench, Claw-Eval-Live, InteractWeb-Bench, AcademiClaw, PhysicianBench, ProgramBench) now report frontier-agent failure rates of 0–55% on realistic multi-step tasks. ProgramBench (05-06) at 0% on every model is the floor; PhysicianBench (46%) and AcademiClaw (55%) are the realistic ceiling. 2026-05-07 adds a tenth dimension: BRIGHT-Pro, the first benchmark for evidence-portfolio retrieval rather than top-1 relevance, evaluating retrievers under both static and agentic protocols. MedSkillAudit (also 05-07) shifts the evaluation surface from agent capability to agent skill release readiness: 75 medical research skills audited, 57.3% below Limited Release threshold, system-expert agreement (ICC=0.449) exceeded the human inter-rater baseline (0.300). The Marcus production-agent security paper (05-06) frames the limit case: 91% of 847 deployed agents are vulnerable to tool-chaining attacks, 89.4% drift after 30 turns. Capability ceiling, evaluation methodology, skill audit, deployment security: the agent benchmarks cluster is now four-dimensional.

Key Benchmarks

OccuBench (2026-04-16) — 100 tasks across 65 professional domains using Language World Models (LWMs) to simulate environments. Key finding: no single model dominates all industries; implicit faults are hardest. → summary

Exploration/Exploitation Measurement (2026-04-16) — Policy-agnostic metric for explore/exploit errors in LM agents on 2D grid environments. Reasoning models perform best; harness engineering meaningfully improves both dimensions. → summary

GameWorld (2026-04-16) — 34 browser games, 170 tasks, state-verifiable outcomes for MLLM game agents. Best models still far below human. → summary

MERRIN (2026-04-16) — Search-augmented agent benchmark with noisy multimodal web evidence. Average accuracy 22.3%; agents over-rely on text modalities. → summary

InfiniteScienceGym (2026-04-16) — Procedurally generated scientific analysis benchmark. No model exceeds 45%; abstention on unanswerable questions is a key weakness. → summary

DR3-Eval (2026-04-18) — Deep Research Agent benchmark. Static per-task corpus sandboxes with evidential sources, confounding documents, and noise. Reverse-constructed questions (derived from verified evidential docs) ensure every task is answerable. Multi-dimensional scoring: recall, factual accuracy, citation coverage, instruction following, depth. State-of-the-art models still struggle. → summary

GTA-2 (2026-04-20) — Two-tier benchmark: GTA-Atomic (single-step tool precision) and GTA-Workflow (long-horizon, open-ended multi-tool coordination). Key results: frontier models below 50% on atomic tasks; top models at 14.39% on workflows. Critical finding: execution harness design (Manus, OpenClaw) matters more than underlying model capability. Uses real user queries and deployed tools — not synthetic evals. Recursive checkpoint-based evaluation for open-ended tasks. → summary

PRL-Bench (2026-04-20) — Physics Research by LLMs benchmark. 100 tasks from Physical Review Letters papers (Aug 2025+, post-training cutoff for most models). Covers 5 subfields: astrophysics, condensed matter, high-energy, quantum information, statistical physics. Tasks replicate authentic research: exploration-oriented formulation, long-horizon workflows, verifiable outcomes. All frontier models score below 50%. Expert-validated. → summary

Claw-Eval-Live (2026-05-01) — First live workflow-agent benchmark. Refreshable signal layer (ClawHub Top-500 skills, updated each release) + reproducible release snapshot (frozen fixtures, services, graders). 105 tasks, 13 frontier models, deterministic + structured-LLM grading on execution traces, audit logs, service state, post-run artifacts. Best model: 66.7%; no model reaches 70%. HR / management / multi-system business workflows persistently fail. → summary

InteractWeb-Bench (2026-05-01) — First benchmark to grade clarifying behavior explicitly. Four user-agent personas + persona-driven instruction perturbations from RE defect taxonomies. Unified agent action space: Clarify / Implement / Verify / Submit. Frontier MLLM agents remain trapped in blind execution — generating code that satisfies their misreading of the instruction without ever asking. → summary

AcademiClaw (2026-05-05) — Bilingual academic-level benchmark, 80 multi-step tasks curated from 230 real student submissions across 25+ professional domains (olympiad math, linguistics, GPU-intensive RL, full-stack debugging). Docker sandbox per task; six-technique multi-dimensional rubric scoring + five-category safety audit. Best of six advanced models: 55%. Capability varies sharply across domains; compute does not predict output quality — argues against current "more thinking tokens equal better results" defaults. → summary

PhysicianBench (2026-05-05) — 100 long-horizon physician tasks from real consultation cases inside an EHR environment with vendor APIs. 21 specialties; ~27 tool calls per task. Best closed-source model: 46% pass@1. Best open-source: 19%. Highest tool-call horizon in any of the eight benchmarks; the gap between knowledge tests (where LLMs match physicians) and EHR-mediated workflows (where they do not) is the load-bearing finding. → summary

BRIGHT-Pro and RTriever-4B (2026-05-07) — first benchmark for evidence-portfolio retrieval rather than top-1 relevance. Each query is expanded with multi-aspect gold evidence; retrievers are graded under both static and agentic protocols. RTriever-Synth, an aspect-decomposed synthetic corpus, generates complementary positives and positive-conditioned hard negatives. RTriever-4B (LoRA on Qwen3-Embedding-4B) substantially improves over its base. Aspect-aware and agentic evaluation expose behaviors hidden by standard top-k metrics. → summary

MedSkillAudit (2026-05-07) — first skill-release-readiness audit framework. 75 medical research skills, two human experts, ordinal release disposition (Production / Limited / Beta / Reject). System-expert ICC = 0.449 vs human inter-rater 0.300. 57.3% of skills below Limited Release threshold. Negative ICC on Academic Writing (-0.567) reveals structural rubric-expert mismatch on open-ended generative tasks. Pre-deployment audit complement to the Marcus post-deployment security study. → summary

OpenSearch-VL (2026-05-07) — open recipe for frontier multimodal search agents. Wikipedia path sampling with fuzzy entity rewriting, source-anchor visual grounding, unified text+image+OCR+image-manipulation tool environment, and multi-turn fatal-aware GRPO (mask post-failure tokens, preserve pre-failure reasoning via one-sided advantage clamping). 10-point average gain across seven benchmarks; matches proprietary commercial models on several. The training-time intervention at the same multi-turn surface where the Marcus security paper measures failures. → summary

Patterns Across Benchmarks

  • Reasoning models consistently outperform base models on agentic tasks
  • Over-exploration is a common failure mode in strong models
  • Professional/domain-specific tasks expose different weaknesses than general benchmarks
  • Deterministic environment generation (OccuBench, InfiniteScienceGym) removes publication bias
  • Execution harness dominates model capability (GTA-2): the scaffold around the model determines workflow completion more than model capability itself. Confirmed empirically by the Ridge Security pentester benchmark (2026-05-04) at constant model: belief state, evidence-as-invariant, and trust propagation account for >5x finding gaps between architectures using the same Gemini 3 Flash backbone.
  • Eight benchmarks now converge on the same finding: frontier models fail realistic multi-step tasks reliably — this is a consistent, cross-domain measurement
  • Middle-band discrimination (Claw-Eval-Live, 05-01): models with similar pass rates diverge in overall completion, suggesting per-task-family routing could outperform any single model
  • Blind execution (InteractWeb-Bench, 05-01): a distinct, named failure mode where agents guess rather than clarify under ambiguous instructions — the first benchmark to grade this dimension explicitly
  • Compute-quality decoupling (AcademiClaw, 2026-05-05): computational resource consumption does not predict output quality across 80 academic-level tasks. The compute-as-proxy default is empirically broken
  • Long-horizon tool-call gap (PhysicianBench, 2026-05-05): 27-call average is the highest horizon in the cluster; the open-source vs closed-source gap (19% vs 46%) is largest at this horizon, suggesting tool-use trace data, not raw capability, is the bottleneck

Related Pages