inference-efficiency · 2026-04-20

AccelOpt: Self-Improving LLM Agent for AI Accelerator Kernel Optimization

AccelOpt: Self-Improving LLM Agent for AI Accelerator Kernel Optimization

TL;DR

An LLM agent that autonomously optimizes kernels for AWS Trainium by accumulating a memory of slow-fast kernel pairs. Raises peak throughput from 49% → 61% (Trainium 1) and 45% → 59% (Trainium 2). Using open-source models, matches Claude Sonnet 4 performance at 26x lower cost. No expert-provided hardware knowledge required.

Key Findings

The core loop:

Generate kernel variant
       │
  Benchmark on Trainium
       │
  Record (slow_kernel, fast_kernel) pair → optimization memory
       │
  Next iteration: consult memory → avoid past mistakes → build on past wins
       │
  Iterate until convergence

The intelligence is in the memory, not the model. The LLM is used as a code generator; the accumulated memory of what worked and didn't provides the domain-specific guidance that would otherwise require an expert. This is the same pattern as TRACER (04-17): convert a hard expert-knowledge problem into a supervised-from-examples problem using accumulated traces.

NKIBench — a new benchmark of AWS Trainium kernels extracted from real LLM workloads (varying complexity). This fills a gap: prior kernel benchmarks (CUDA-focused) don't transfer to Neuron/NKI programming models.

Cost asymmetry: 26x cost reduction vs. Claude Sonnet 4 while matching quality. The task (pattern-match from slow-fast examples → generate improved variant) turns out to be tractable for open-source models once the memory scaffold provides the right context. This suggests kernel optimization is a well-structured enough domain that task framing + memory can compensate for model capability gaps.

Improvement plateau: 49% → 61% leaves 39% of peak throughput on the table. The memory-based approach helps but doesn't close the full gap — likely because some optimizations require genuinely novel reasoning not captured in the slow-fast memory.

Relations to Prior Wiki Pages

  • TRACER (04-17): Same pattern — accumulate the LLM's own outputs as training signal for a cheaper replacement. TRACER does this for classification (production logs → surrogate). AccelOpt does this for generation (benchmark runs → memory). Both exploit the insight that a capable model's traces are better training data than hand-curated examples.
  • Hardware gap: The wiki has no existing hardware/GPU kernel concept page. AccelOpt is the first hardware-tier paper. Creating wiki/hardware/gpu-kernels.md.

Raw Source

raw/huggingface/2026-04-20-accelopt-a-self-improving-llm-agentic-system-for-ai-accelera.md