KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

TL;DR

LLM-generated Triton kernels are the most-hyped subfield of automated systems work. KernelBench-X actually measures it. 176 tasks across 15 categories, evaluated for both correctness and hardware efficiency. The result is a clear capability map. Task structure dominates method choice: category explains nearly 3x more variance in correctness than method (9.4% vs 3.3%). Fusion is broken (72% failure across all five tested methods). Math is solved. Quantization is completely unsolved, 0/30 successes. Iterative refinement raises correctness but not performance. From Tsinghua.

Why this matters

The headline number, 0/30 on quantization, is the deployment story. Quantized inference is the largest single source of frontier-lab cost savings (TurboQuant on 04-22 was the cleanest example). The kernels that make quantization viable on production hardware are exactly the kernels LLMs cannot generate. So the bottleneck on automated kernel generation is not "make the model better at code." It's "make the model understand low-level numerics under hardware constraints." Those are different problems.

Mechanism

KernelBench-X structure:
  176 tasks                                Category-level results:
  ──────────                                ────────────────────
  15 categories                              Math:        ~all solved
   - Math                                    Fusion:      28% solved
   - Fusion           ┐                      Quantization: 0/30
   - Quantization     ├─ tested across       Reduction:   intermediate
   - Reduction        │  5 different
   - Memory           │  generation
   - ...              ┘  methods             Method explains 3.3% var
                                             Category explains 9.4% var
  Each task evaluated on:                    → category dominates
   - Semantic correctness
   - Hardware efficiency (vs hand-tuned)

The category-vs-method variance decomposition is the methodological contribution. Prior benchmarks (KernelBench, TritonBench, MultiKernelBench, Robust-KBench) measured aggregate pass rates and ranked methods. KernelBench-X measures where methods break, and shows that the failure-mode signature is mostly task-type, not method.

Connections to prior wiki

Confirms and refines AccelOpt (04-20). AccelOpt was the first wiki-tracked LLM-driven GPU kernel optimization system. It demonstrated wins on a narrow category (matmul-shaped tasks). KernelBench-X explains why that win didn't generalize: AccelOpt's category was favorable. Quantization, fusion, memory-bound categories remain fundamentally harder.

The iterative refinement finding contradicts a prior assumption. Auto Research with Specialist Agents (2605.05724, also today) and ML Intern (04-22) both argue that iterative refinement loops produce reliable empirical wins on training-recipe optimization. KernelBench-X says: not on kernel generation. Iteration improves correctness (the kernel compiles and runs) but not performance (the kernel is still slow). That asymmetry matters for system design. Refinement loops fix correctness; they do not surface optimization tricks the model didn't already know.

This is the third Tier 1 GPU paper in three weeks (AccelOpt 04-20, Stream-CQSA in this week's Kurate cs.LG #19, KernelBench-X today). The thread: automated kernel work is moving from speculative demo to systematic benchmark, and the benchmark is producing falsifiable failure-mode claims.

Research angle

Why is quantization 0/30? This is the single most important open question for production deployment. Is it because quantization kernels require numerical reasoning (rounding, dequant scaling, error compounding) that pretraining doesn't cover? Is it because the test cases have hardware-specific edge conditions? Is it that LLMs lack the bit-level mental model? A targeted training/eval paper on quantization-only kernel generation would be high-value.
What's the minimal architectural primitive that fixes Fusion? 72% failure across all five methods on Fusion means it's a primitive failure, not a method-quality failure. Fusion requires reasoning about dataflow across multiple ops. That's structurally similar to multi-step reasoning. Possibly the right approach is RL on dataflow graphs, not next-token kernel generation.
Does iterative refinement scaffold a different way? KernelBench-X shows iteration fixes correctness, not performance. So the right scaffold for performance might not be iterative refinement at all. It might be search over kernel templates plus a learned cost model, similar to TVM or Triton's autotuner but driven by an LLM-generated proposal stream.

Source

Paper: https://arxiv.org/abs/2605.04956
HuggingFace: https://huggingface.co/papers/2605.04956
Raw: raw/huggingface/2026-05-09-kernelbenchx-a-comprehensive-benchmark-for-evaluating-llmgen.md