The First Token Knows: Single-Decode Confidence for Hallucination Detection
arXiv: 2605.05166 Tier: 2 — responsible-ai / inference efficiency
TL;DR
The normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode (call it phi_first) matches or modestly exceeds semantic self-consistency on closed-book short-answer factual QA. AUROC 0.820 vs 0.793 across three 7-8B models (Llama-3.1-8B, Mistral-7B-v0.3, Qwen2.5-7B) on PopQA + TriviaQA (n=1000 each). Compute saving is structural: one greedy decode versus one greedy + ten samples + NLI clustering. Roughly 1/11 the generation cost.
What the metric does
phi_first reads the entropy of the top-K logits at the model's first content-bearing answer token, after a single greedy decode. The signal is that low-entropy top-K means the model has high confidence in the answer's first content word, which correlates strongly with whether the eventual full answer is hallucinated.
Question: "What year did the Treaty of Westphalia end?"
│
▼
greedy decode begins
│
▼
first content-bearing token: "1"
top-K logits at this position: [1, 6, 7, ...]
│
▼
normalized entropy of these logits
│
▼
phi_first = uncertainty score
Compare to the standard alternative (semantic self-consistency):
greedy decode (1×) + 10 sampled decodes + NLI model clusters answers by meaning
→ uncertainty = inverse of cluster agreement
Result summary
| Method | AUROC (mean across 3 models × 2 benchmarks) |
|---|---|
| phi_first (single decode) | 0.820 |
| Semantic self-consistency (1 + 10 + NLI) | 0.793 |
| Surface-form self-consistency | 0.791 |
phi_first is moderately to strongly correlated with semantic agreement (Pearson 0.54-0.76). A logistic ensemble of phi_first + semantic agreement yields only +0.02 AUROC over phi_first alone, evidence that phi_first captures most of semantic agreement's discriminative power.
The partial-correlation analysis on answer length is the methodological touch worth noting. The apparent association between phi_first and answer length largely disappears after controlling for correctness. The signal is real, not a length artifact.
How this relates to prior wiki work
- Net new for the responsible-ai topic. The wiki has been tracking agent security (05-04) and agentic posttraining alignment (04-22), but no prior page has covered hallucination-detection efficiency.
- Cross-source intersection with the wiki's general inference-efficiency thread. phi_first reduces hallucination guardrail compute by an order of magnitude. If this generalizes beyond closed-book short-answer QA, it's deployment-changing.
- Connection to today's SxS Disclosure Policies paper. Both papers operate on the question of when an LLM commits to an answer. phi_first reads commitment from the model's logits. SxS makes commitment a learned action. Two complementary framings of the same problem.
What's surprising
The recommendation in the abstract is the kind of falsifiable claim that's rare in this literature. "First-token confidence should be reported as a default, low-cost baseline before invoking sampling-based uncertainty estimation." This is a methodological intervention as much as a result. If the broader literature adopts phi_first as a baseline, it will reset the bar that subsequent uncertainty-estimation methods need to clear.
Open questions
- Long-form generation. All experiments are short-answer factoid QA. Whether the first-token signal survives in long-form structured generation is open.
- Tool-grounded outputs. The wiki's main interest is agentic systems where the "answer" is a tool call. phi_first as a deployment-time gate at the first action token is the obvious next experiment.
- Multi-stage reasoning. Chain-of-thought prefixes the answer with reasoning. Where exactly is the "first content-bearing answer token" in a CoT trajectory? The paper's setup is direct-answer; CoT is where the practical questions live.
Why it matters
Production-scale hallucination detection has been blocked on cost. Ten-sample generation per question is not deployable for most agent systems. phi_first is one decode. The cost structure of hallucination guardrails changes by an order of magnitude if this generalizes. For the wiki's agent-security thread, this is the deployment-time piece that production agents need.