AccelOpt: Self-Improving LLM Agent for AI Accelerator Kernel Optimization
TL;DR
An LLM agent that autonomously optimizes kernels for AWS Trainium by accumulating a memory of slow-fast kernel pairs. Raises peak throughput from 49% → 61% (Trainium 1) and 45% → 59% (Trainium 2). Using open-source models, matches Claude Sonnet 4 performance at 26x lower cost. No expert-provided hardware knowledge required.
Key Findings
The core loop:
Generate kernel variant
│
Benchmark on Trainium
│
Record (slow_kernel, fast_kernel) pair → optimization memory
│
Next iteration: consult memory → avoid past mistakes → build on past wins
│
Iterate until convergence
The intelligence is in the memory, not the model. The LLM is used as a code generator; the accumulated memory of what worked and didn't provides the domain-specific guidance that would otherwise require an expert. This is the same pattern as TRACER (04-17): convert a hard expert-knowledge problem into a supervised-from-examples problem using accumulated traces.
NKIBench — a new benchmark of AWS Trainium kernels extracted from real LLM workloads (varying complexity). This fills a gap: prior kernel benchmarks (CUDA-focused) don't transfer to Neuron/NKI programming models.
Cost asymmetry: 26x cost reduction vs. Claude Sonnet 4 while matching quality. The task (pattern-match from slow-fast examples → generate improved variant) turns out to be tractable for open-source models once the memory scaffold provides the right context. This suggests kernel optimization is a well-structured enough domain that task framing + memory can compensate for model capability gaps.
Improvement plateau: 49% → 61% leaves 39% of peak throughput on the table. The memory-based approach helps but doesn't close the full gap — likely because some optimizations require genuinely novel reasoning not captured in the slow-fast memory.
Relations to Prior Wiki Pages
- TRACER (04-17): Same pattern — accumulate the LLM's own outputs as training signal for a cheaper replacement. TRACER does this for classification (production logs → surrogate). AccelOpt does this for generation (benchmark runs → memory). Both exploit the insight that a capable model's traces are better training data than hand-curated examples.
- Hardware gap: The wiki has no existing hardware/GPU kernel concept page. AccelOpt is the first hardware-tier paper. Creating
wiki/hardware/gpu-kernels.md.
Raw Source
→ raw/huggingface/2026-04-20-accelopt-a-self-improving-llm-agentic-system-for-ai-accelera.md