Precise Debugging Benchmark: Models Regenerate, They Don't Debug

Date: 2026-04-21
Source: HuggingFace Daily Papers
Paper: arxiv 2604.17338
Raw: raw/huggingface/2026-04-21-precise-debugging-benchmark-is-your-model-debugging-or-regene.md

TL;DR

PDB (Precise Debugging Benchmark) measures not just whether bugs are fixed but how precisely: edit-level precision (necessary edits only) and bug-level recall (all bugs resolved). Frontier models (GPT-5.1-Codex, DeepSeek-V3.2-Thinking) achieve >76% unit test pass rates but precision below 45% — they regenerate correct but over-edited solutions rather than making targeted fixes. Iterative and agentic debugging strategies don't improve precision or recall.

Key Findings

Two new metrics: edit-level precision (fraction of edits that are necessary), bug-level recall (fraction of bugs actually resolved, not just masked)
Unit test pass rate is misleading: models that pass 76%+ of tests have precision below 45% — they're rewriting code, not debugging it
Over-editing is the dominant failure mode: models replace large code sections that don't need to change
Agentic strategies don't help: iterative debugging with tool use doesn't improve precision or recall — the fundamental failure is in the post-training objective, not the inference strategy
Two benchmarks released: PDB-Single-Hard (single-line bugs) and PDB-Multi (multi-line bugs)

Connection to Prior Wiki Work

This is the fourth consecutive benchmark in the wiki measuring the same meta-failure: frontier models appear to solve a task by conventional metrics but avoid the hard, targeted reasoning the task actually requires.

Benchmark	What models avoid	Surface metric	Real metric
AIMO 3 (04-17)	Hard reasoning under fixed compute	Majority vote score	pass@20 gap
DR3-Eval (04-18)	Genuine multi-source synthesis	Recall	Citation accuracy
GTA-2 (04-20)	Multi-step tool coordination	Single-step pass rate	Workflow completion
PDB (04-21)	Targeted fault localization	Unit test pass	Edit precision

The pattern: passing the benchmark ≠ doing the thing the benchmark is meant to test. In every case, models find a shortcut that satisfies the surface metric without solving the underlying cognitive task.

Implications for Coding Agent Post-Training

The PDB result is a direct indictment of coding model post-training pipelines. If models are trained to maximize unit test pass rate, they learn to regenerate correct solutions — which is correct behavior under that reward signal but wrong behavior for a debugging assistant. The fix requires a post-training objective that explicitly rewards surgical editing, not just correctness.

Precise Debugging Benchmark: Models Regenerate, They Don't Debug

Precise Debugging Benchmark: Models Regenerate, They Don't Debug

TL;DR

Key Findings

Connection to Prior Wiki Work

Implications for Coding Agent Post-Training

Related Pages