AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
arXiv: 2605.16819 · HF: paper page · Tier: 1 (GPU kernels, agentic systems, benchmarks)
TL;DR
AgentKernelArena is the first GPU-kernel-optimization benchmark designed for full agent workflows (read code, invoke compilers, run profilers, iterate) rather than single-shot LLM calls. 196 tasks across HIP-to-HIP, Triton-to-Triton, and PyTorch-to-HIP translation, evaluated in isolated workspaces with gated compilation / correctness / performance checks, plus an unseen-configuration generalisation protocol. Across Cursor Agent, Claude Code, and Codex Agent, best configurations reach mean speedups of 6.89x (PyTorch-to-HIP), 6.69x (HIP-to-HIP), and 2.13x (Triton-to-Triton). HIP-to-HIP and Triton-to-Triton optimisations largely transfer to unseen input shapes; PyTorch-to-HIP exhibits substantial correctness drops, meaning agents generating kernels from scratch frequently hardcode shape-specific assumptions.
Key findings
- Existing GPU kernel benchmarks evaluate single LLM calls. None included kernel-to-kernel optimisation, none tested unseen-configuration generalisation, and none evaluated full agent workflows where the agent reads code, runs compilers and profilers, and iterates.
- AgentKernelArena's 196 tasks span three workflows: HIP-to-HIP optimisation (same language, optimise existing kernel), Triton-to-Triton optimisation (same), and PyTorch-to-HIP translation (cross-language, generate a kernel from a PyTorch reference).
- Evaluation is in isolated workspaces with gated compilation, correctness, and performance checks plus centralised scoring. The unseen-configuration protocol tests whether the agent's optimised kernel still runs correctly on input shapes the agent never observed during the optimisation loop.
- Best configurations of Cursor Agent, Claude Code, and Codex Agent reach mean speedups of 6.89x (PyTorch-to-HIP), 6.69x (HIP-to-HIP), and 2.13x (Triton-to-Triton). Correctness rates are near-perfect on most task categories.
- Unseen-configuration evaluation: HIP-to-HIP and Triton-to-Triton optimisations largely transfer to unseen shapes. PyTorch-to-HIP exhibits substantial correctness drops, meaning agents that generate kernels from scratch tend to hardcode shape-specific assumptions.
- Framework is modular and extensible across agents, tasks, and hardware targets.
Relationship to prior wiki entries
The wiki's GPU-kernel agent thread now has three stable entries. KernelBench-X (2026-05-09, the LLM GPU-kernel benchmark covering 16 frontier models on 250 PyTorch operations measuring correctness, speedup, hardware utilisation, and cross-GPU consistency) was the single-call benchmark. Cutile-rs (2026-05-17, the Rust-based CUDA tile programming substrate flagged in last week's social-stream) is on the authoring side. Fournex GPU bottleneck analyser (2026-05-18 Industry Pulse, the open-source tool that turns Nsight Compute output into evidence-backed CUDA optimisation recommendations classified by bottleneck type) is on the diagnostic side.
AgentKernelArena fills the third leg: the agentic-workflow benchmark. The 6.89x mean speedup on PyTorch-to-HIP and 6.69x on HIP-to-HIP is the first time the wiki has tracked agentic kernel optimisation hitting the 6x range on a curated benchmark; it indicates that the agent-harness improvements the field has been making in coding agents (Composer 2.5 today, Claude Code at large-codebase scale, Codex CLI's RL with textual feedback) translate to GPU-kernel-specific tasks.
The PyTorch-to-HIP correctness drop on unseen shapes is the load-bearing failure mode finding. It echoes the wiki's broader thread on deployment-calibration brittleness: WildClawBench (2026-05-15, the 18-point harness spread), CurveBench (2026-05-17, the visual-reasoning RLVR-recoverable gap), PAGER and DiagnosticIQ (2026-05-18, the GUI-and-industrial-rule calibration benchmarks). All five say: agents look strong on the configurations they saw, fragile on configurations they did not. AgentKernelArena now extends that pattern from text-domain tasks to GPU-kernel generation.
Why it matters
GPU kernels are the substrate of inference and training cost. The cost of human kernel-authoring time on H200, B200, and now Blackwell B300 is the primary bottleneck for porting research code to production. A benchmark that measures agentic kernel optimisation under realistic workflow conditions, with generalisation testing, is the missing measurement infrastructure for that bottleneck. The 6x speedups on HIP-to-HIP and PyTorch-to-HIP at near-perfect correctness on seen configurations are large enough to justify the human-time investment in setting up AgentKernelArena harnesses inside production teams.
The shape-hardcoding failure on PyTorch-to-HIP is the deployment caveat. Frontier serving stacks have heterogeneous shapes across workloads. A kernel that runs 6x faster at the optimised shape but breaks at adjacent shapes is not deployable. The benchmark is the diagnostic; the open question is which agent-harness recipe (parameterised generation, shape-aware prompting, explicit unseen-configuration training) closes the gap.
Research angle
- What harness-level change closes the shape-hardcoding gap? The current failure is in the agent's coding pattern, not the underlying LLM. Whether explicit shape-parameterisation prompts, dimensional-symbol training, or shape-aware autotuning solves it is the natural follow-up.
- Cross-hardware generalisation. The benchmark targets HIP (AMD) and Triton (cross-vendor). Whether the same agents transfer to CUDA-native kernels with H200 / B200 / Blackwell-specific intrinsics is the next question. AgentKernelArena is extensible across hardware targets; the wiki should track the first CUDA-on-Blackwell results.
- Composing with the kernel-diagnostic stack. The 2026-05-18 Fournex bottleneck analyser produces evidence-backed remediations. The natural recipe is to give the agent that analyser's output as a tool call. Whether the agent that has tooling beats the agent that does not, and by how much, is the deployment-relevant compose.
Source
raw/huggingface/2026-05-19-agentkernelarena-generalization-aware-benchmarking-of-gpu-ke.md