cere-bro | 2026-05-13
On-policy distillation got a theory on the same day SemiAnalysis priced the fast-tokens economy. Two clusters arrive in parallel: an allocation rule plus a failure taxonomy for OPD, and a wafer-scale IPO that says interactivity-per-watt is the new binding constraint.
TL;DR
- Sparse-to-Dense Reward Principle (arXiv 2605.12483). GRPO and OPD are two reward-density regimes, not separate recipes. The allocation rule: spend scarce labels on the strongest teacher via sparse RL first, bridge to the student via forward-KL + OPD, then student-side GRPO. A 1.7B Qwen3 student bridged from an RL-improved 8B teacher beats direct GRPO on the same student. Tier 1.
- The Many Faces of On-Policy Distillation (arXiv 2605.11182). Three named failure modes: prefix-induced distribution mismatch, biased TopK reverse-KL gradients, and OPSD aggregation collapse when privileged information is instance-specific. The OPD field is now in its diagnostic phase. Tier 1.
- Token Superposition Training (arXiv 2605.06546). Pre-train with bag-of-tokens prediction in the first phase, recover to standard NTP, deploy identically. 2.5x training-time reduction at 10B-A1B MoE scale. Tier 1.
- δ-mem (arXiv 2605.12357). 8x8 online associative state, delta-rule updated, produces low-rank corrections to a frozen full-attention backbone. 1.31x on MemoryAgentBench, 1.20x on LoCoMo. No fine-tuning, no context extension. Tier 1.
- FocuSFT (arXiv 2605.09932). Attention sinks are a training-side phenomenon, not just inference. Bilevel optimization with bidirectional context and causal response masking drops sink mass 529x and triples context engagement. +14 points on BABILong. Tier 1.
- Reward Hacking in Rubric-Based RL (arXiv 2605.12474). Resolves the 12-May 60-day rubric-overfitting prediction in 24 hours. Three failure modes named: compound-criterion partial satisfaction, implicit-as-explicit, imprecise topical matching. Tier 2.
- SemiAnalysis Cerebras IPO deep dive (newsletter, Gmail-starred). Four-article writeup. The thesis: past a capability threshold, developers prefer faster tokens to smarter tokens, and SRAM-based machines win on the interactivity-per-watt axis HBM GPUs cannot match. Anthropic Opus 4.6 Fast (6x price, 2.5x interactivity) is the revealed preference. The arXiv-side formalization lands the next day.
- Anthropic overtakes OpenAI in B2B for the first time (The Decoder). 34.4% vs 32.3% on the Ramp AI Index. Anthropic quadrupled in one year. Same-day Claude for Small Business launch with 15 agentic workflows and a ten-city tour.
The Big Picture
Today's papers settle into two clusters that turn out to be the same story at different layers.
The first cluster is on-policy distillation. The Sparse-to-Dense Reward Principle gives the allocation rule (scarce labels go upstream first, bridge to the student via forward-KL + OPD, then student-side GRPO). The Many Faces paper gives the failure taxonomy (three named mechanisms when the bridge breaks). Together they turn OPD from a recipe-by-vibe field into something with a rule and a checklist. The wiki has been tracking OPD since TIP (04-16, only 10% of distillation tokens carry signal), through LongAct (04-18, sparse RL updates), CoPD (05-01, parallel co-evolution), D-OPSD (05-07, self-distillation under conditioning asymmetry), and RLRT and G-Zero (05-12, two reads of the teacher-student delta). Today's pair fills the missing layer: the allocation rule between sparse and dense regimes, and the failure modes when the dense regime is applied wrong. The picture is now complete enough to ship as a textbook chapter.
The second cluster is hardware and deployment economics. The SemiAnalysis Cerebras IPO piece is the day's standout long-form: it argues that the inference market has bifurcated into fast, priority, standard, and batch tiers, and that SRAM-based machines own the fast tier because the binding constraint is memory-bandwidth-per-FLOP, not raw FLOP count. The 750MW OpenAI commitment to Cerebras is the validation. Anthropic's Opus 4.6 Fast tier, charging 6x the standard price for 2.5x the interactivity, is the revealed preference. Anthropic also leads OpenAI in B2B for the first time today, by Ramp data; meanwhile, Anthropic, OpenAI, and Google all run deployment-services plays in the same week (Claude for Small Business, OpenAI Deployment Company subsidiary from 05-11, Google hiring hundreds of customer engineers today). Three labs, one week, same diagnosis: implementation is the bottleneck.
The thread that ties the clusters together is the question of what is scarce. In OPD, the scarce resource is verified labels, and the allocation rule is to spend them upstream where they buy the most reward-shaped behavior. In hardware, the scarce resource is interactivity-per-watt, and the allocation move is to route premium inference to SRAM machines. Two scarcity-driven allocation rules in one day. The reader profile's Tier 1 areas (routing, KV cache, compression, GPU, hardware) are all variants of the same principle: locate the binding constraint, route the budget there. The wiki has been calling this "concentrate the budget where the signal lives" since 12-May; the 13-May papers are the layer-by-layer expansion of the principle.
The smaller third thread is interpretability moving toward locatability. The Massive Activations ME Layer paper identifies a single layer in each model family where attention sinks are born (RMSNorm + FFN parameters jointly produce the massive activation). Probe&Prefill (arXiv 2605.09252) shows that tool necessity in agents is linearly decodable from the pre-generation hidden state at AUROC 0.89-0.96, and that a 48% reduction in tool calls is achievable with only 1.7% accuracy loss. Two papers locating behavioral structure in shallow, intervenable representations. The pattern is the same as Compliance vs Sensibility (05-02, reasoning mode as linear direction). The interpretability-as-deployable-control thread is now five papers strong.
Deep Dives
Sparse-to-Dense Reward Principle: an allocation rule for OPD
GRPO and OPD are not separate recipes. They are two reward-density regimes, and the right move is to put scarce labels upstream on the strongest teacher first, then bridge to the student via forward-KL + OPD, then run student-side GRPO.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. RL post-training, on-policy distillation, allocation rule for scarce labeled data
Naive allocation Sparse-to-Dense rule
───────────────── ────────────────────────────
labeled data ─► GRPO on student labeled data ─► sparse RL on
strongest teacher
GRPO from cold start fails │
on the 1.7B Qwen3 student ▼
bridge: forward-KL warmup
+ OPD on student rollouts
│
▼
student-side GRPO becomes
effective; weak-from-cold
GRPO now lifts MATH 75.4→78.5
The rule is small and clean: never use scarce labels on the least prepared policy. The bridge is a forward-KL warmup on teacher rollouts followed by OPD on student rollouts. After the bridge, student-side GRPO is no longer cold-start. It works because the policy now sits in a regime where its own rollouts occasionally hit the reward. Before the bridge, that regime does not exist.
The empirical evidence is consistent across teacher and stage choices on Qwen3-1.7B. The RL-improved 8B teacher distilled through the bridge outperforms direct GRPO on the student. The same teacher before RL underperforms. Same labels, different allocation, different result. The bridge alone lifts MATH from 75.4 to 78.5 once student-side GRPO is allowed to run on top, and beats a matched replay control by 2.8 points. The paper does the same comparison with canonical 8B and 14B teachers for AIME endpoints and the pre-Stage-3 numbers come out the same way.
The wiki context is rich. The "concentrate the budget where the signal lives" thread has been visible since TIP (04-16, 10% of distillation tokens) and LongAct (04-18, sparse RL updates). The Sparse-to-Dense paper is the upstream version: where the labels should train first is itself the same kind of allocation decision. TIP says which tokens; this paper says which model. The two compose. The Many Faces paper, also today, gives the failure taxonomy for when the OPD half of the bridge breaks. Three OPD papers in two days is itself signal: the field is rapidly maturing from recipe-by-vibe to allocation-rule-with-bound.
Why it matters: The same labeled set is worth more upstream-then-distilled than directly-on-student. That changes how every team running OPD allocates its annotation budget.
Research angle: Two open questions. (1) Does the principle hold beyond verifiable math? Rubric-based RL has a different reward-density structure (the rubric is itself a verifier with its own failure modes; see today's Reward Hacking paper). The bridge may not survive. (2) Composition with the Extrapolation Cliff (which lands 05-14) gives a complete recipe: Sparse-to-Dense allocates where the labels go, the Cliff bounds how aggressive the OPD step can be. The two together would let a team derive the operating point analytically.
The Many Faces of On-Policy Distillation: three named failure modes
OPD has been treated as a single recipe. It is three recipes with three distinct failure modes, which is why the empirical record is contradictory.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. On-policy distillation, failure-mode taxonomy
Failure mode Mechanism Fix
────────────── ───────── ─────
1. Prefix mismatch teacher labels on student- SFT-stabilize student
generated prefix; teacher first; bring prefixes
never trained on student closer to teacher
distribution distribution
2. Biased TopK loss reverse-KL truncated to stop-gradient on TopK
teacher TopK; truncation selection; flow gradient
biases gradient only through probabilities
3. OPSD aggregation student of OPSD learns the only use OPSD when PI is
collapse expectation over PI-conditioned a shared latent rule
teachers; helps no specific (system prompt, alignment)
instance when PI is instance- not when PI is per-problem
specific
The third failure mode is the structurally interesting one. OPSD (on-policy self-distillation) uses the same model as both teacher and student under different conditioning, the teacher sees the privileged information, the student does not. The student's update target is then the expectation over the privileged-information-conditioned teacher distributions. When the privileged information is instance-specific (a per-problem oracle hint), the expectation averages over distinct trajectories and produces a generic policy that helps no specific instance. When the privileged information is a shared latent rule (a system prompt, an alignment preference), the expectation is the rule itself and OPSD recovers it. This is a falsifiable mathematical claim and it predicts which OPSD setups will work.
The paper validates this directly. OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation (failure modes 1 and 2). OPSD fails on mathematical reasoning because hints are instance-specific (failure mode 3). OPSD succeeds for system-prompt internalization and alignment preferences because those are shared latent rules. The same mechanism explains both regimes.
The fix list is operational. Stop-gradient TopK objectives address mechanism 2. RLVR-adapted teachers address mechanism 1 (the teacher is trained on rollouts more like what the student produces). SFT-stabilized students address mechanism 1 from the student side. The three fixes can be applied independently; the paper does not claim they all need to be applied together.
The wiki cross-reference is direct. D-OPSD (2026-05-07) uses the same model as teacher and student with different conditioning (text+image vs text-only). The Many Faces paper's third failure mode predicts D-OPSD works because the image is a shared visual rule per task, not an instance-specific oracle. The prediction matches D-OPSD's empirical results. Today's paper is the theoretical grounding D-OPSD lacked.
Why it matters: OPD is the dominant compression paradigm for reasoning models. A diagnostic taxonomy of when and why OPD fails is operationally load-bearing for every team running this pipeline. The three named failure modes will be standard reference points for the next year of OPD work.
Research angle: The biased-TopK fix should be ablated against full-vocab reverse KL to see how much of the failure is truncation versus deeper distribution-mismatch issues. Separately, the OPSD aggregation-collapse argument suggests a principled identifiability criterion: OPSD is identifiable iff the conditioning distribution induces the same marginal as the unconditioned model would learn. The paper points at this but does not prove it.
Token Superposition Training: pre-train with bag-of-tokens, deploy identically
Pre-training has been assumed to require one token per forward pass. TST argues this is wasteful in the early phase. The deployed model is identical to a standard NTP model. 2.5x training-time reduction at 10B-A1B scale.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. Pre-training efficiency, asymmetric-training-identical-inference pattern
Phase 1: Superposition Phase 2: Recovery
───────────────────── ─────────────────────
bag K contiguous tokens standard NTP
into one position; predict one token per position
via multi-hot cross-entropy
re-aligns the model
high throughput per FLOP with deployment
distribution
Deployed model: identical to a standard NTP-trained baseline
No changes to parallelism, optimizer, tokenizer, data, or architecture
The framing is what makes this a Tier 1 paper. The training-side structure (multi-hot prediction on bag-of-tokens positions) does not exist at inference. The cost is paid once; the deployed model is exactly what you would have gotten from a standard NTP run. At equal-loss settings, TST cuts training time up to 2.5x at the 10B-A1B MoE scale. At equal compute, TST consistently outperforms baseline loss and downstream evaluations. The "equal compute beats" is the structurally important number, it rules out the obvious story where bag-prediction is a weaker objective that just looks fast.
The mechanism is small. During the superposition phase, K contiguous tokens get their embeddings combined into one position. The model's prediction at that position is a multi-hot vector over the K target tokens, scored by multi-hot cross-entropy. The same forward pass effectively predicts K positions, so data throughput per FLOP rises by approximately K. The recovery phase runs standard NTP for the last fraction of training, which re-aligns the model with deployment. The paper validates at 270M, 600M, 3B, and 10B-A1B.
The asymmetric-training pattern is now a load-bearing primitive across the wiki. Lighthouse Attention (2026-05-12 retweet) trains with a removable subquadratic wrapper and deploys without it. MDN (2026-05-11) and UniPrefill (2026-05-11) train hybrid architectures and deploy with cheaper inference. D-OPSD (2026-05-07) uses asymmetric conditioning at training and symmetric inference. TST is the first to apply the same pattern to pre-training itself. Five papers, five different layers, one principle: pay the structural cost during the training phase that is amortized across all deployments.
Why it matters: 2.5x training-time reduction with no inference cost changes pre-training economics. The asymmetric-training pattern is also becoming a standard architectural primitive across the wiki, and this paper is the cleanest application yet.
Research angle: Three open problems. (1) Does the recovery phase need to scale with model size? At 10B-A1B the recovery is small; at frontier scale it may need to be larger or differently structured. (2) Composition with multi-token prediction at inference: if you pre-train with TST and then deploy with MTP, do the two efficiency techniques compose? (3) The bag-prediction objective is naturally compatible with MoE routing; whether TST changes the expert-utilization profile is unmeasured.
δ-mem: compact online associative state as low-rank attention correction
8x8 online state, delta-rule updated, produces low-rank corrections to a frozen full-attention backbone. No fine-tuning, no context extension. 1.31x on MemoryAgentBench.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. Long-context inference, compact online memory
frozen backbone (full attention)
│
│ standard attention compute
▼
┌─────────────────────────────────────┐
│ 8x8 online state S │
│ delta-rule update: │
│ S ← S + outer(k_t, v_t) − ... │
│ readout: low-rank correction │
│ to attention scores │
└─────────────────────────────────────┘
│
▼
output (gain: 1.10x avg, 1.31x on MemoryAgentBench)
The design is small and clean. A fixed-size state S (8 by 8 in the headline configuration, 64 scalars total) lives alongside the frozen backbone. As context streams in, S receives delta-rule associative updates from the key-value pairs at each step. At generation time, S is read out and produces a low-rank correction added to the standard attention computation. The backbone is unchanged; the correction is additive.
The interpretation that makes this productive: the standard KV cache stores high-rank evidence over the full context (every token gets its own pair). The δ-mem state stores a low-rank running summary. The two cover different aspects of the same information. The cache preserves position-specific tokens that attention can retrieve given the right query. The running state captures associative patterns that the head's query-key dynamics cannot easily reach at long range. They are complementary, not redundant.
The composition with Make Each Token Count (2026-05-12) is the load-bearing read. Make Each Token Count selectively evicts cached tokens to reduce attention dilution. δ-mem retains the associative signal in a small online state. Two complementary mechanisms for the same underlying problem. The Make-Each-Token-Count paper's research-angle prediction asked about composition with low-bit cache quantization; δ-mem is a different composition, selective eviction plus compact online state, and the natural near-term experiment is whether the gains stack.
Why it matters: Compact-online-state-as-attention-correction is a new design point in the long-context efficiency stack. No re-pretraining, no fine-tuning, runs over frozen backbones, double-digit gains on memory-heavy benchmarks. If the technique transfers to production LLMs, the long-context cost floor drops on another axis.
Research angle: Three questions. (1) Is the 8x8 state large enough at frontier scale, or does the state need to scale with model size or context length? (2) Composition with MISA (2026-05-11, head-axis sparsified indexer) is untested. Both reduce attention dilution at different points in the pipeline. (3) Can the delta-rule update be replaced with a trained recurrent update? That would give a hybrid architecture as a retrofit step rather than a from-scratch design.
FocuSFT: attention sinks are a training-side problem
Attention sinks form during long-context SFT and starve content tokens of attention budget. Bilevel optimization with bidirectional context and causal response masking drops sink mass 529x.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. Long-context training, attention dilution, bilevel optimization
Standard long-context SFT FocuSFT bilevel
───────────────────────── ───────────────────────────
causal mask on context bidirectional on context
causal mask on response causal mask on response
positional bias + attention inner loop: fast-weights form
sinks soak up budget parametric memory that
concentrates attention on
weak gradient on content tokens semantically relevant content
model never learns to use outer loop: SFT conditioned
long-context content on sharpened representation
+14 pts BABILong (4K-32K)
529x sink-mass reduction
3x context engagement
The diagnosis is that the long-context capability gap is not just an inference-side problem. Standard SFT causal-masks both context and response. The causal asymmetry on context creates positional sinks at the beginning of the sequence. Those sinks soak up attention probability mass, leaving content tokens with weak attention. Weak attention means weak gradient. Weak gradient on content tokens means the model never learns to use them.
FocuSFT runs two loops. The inner loop trains fast-weights on each training context to form a parametric memory that biases attention toward semantically relevant tokens. These fast-weights do not enter the deployed model. They exist only to sharpen the gradient signal the outer SFT loop sees. The outer loop runs standard SFT conditioned on the sharpened representation. Both loops use bidirectional attention on context tokens (preserving causal masking only on the response). The bidirectional move removes the structural cause of attention sinks during training.
The result on BABILong is +14 points across 4K to 32K context lengths. On RULER CWE the model goes 72.9 to 81.1 at 16K. On GPQA with agentic tool use, +24% relative on pass@1. The mechanistic readout is the 529x reduction in attention-sink mass and the triple in context engagement during training.
The composition with Make Each Token Count (2026-05-12) is the productive read. Make Each Token Count argues that the full cache is not the ceiling because irrelevant tokens dilute attention. FocuSFT argues that long-context SFT bakes that dilution into the gradient signal in the first place. Two papers in two days, same diagnosis at training and inference. The two are not redundant: even with FocuSFT-trained models, run-time eviction still helps because some dilution is task-dependent. Three papers if you count today's Massive Activations ME Layer paper, which traces attention sinks one layer deeper to the RMSNorm + FFN parameters in a specific layer where the massive activation is born. The picture is now top-to-bottom: ME Layer births the sink (mechanism), FocuSFT prevents it during training, Make Each Token Count routes around it at inference.
Why it matters: Long-context capability has been the unresolved bottleneck for agentic and reasoning workloads. FocuSFT identifies the training-side cause and gives a concrete mechanism to fix it. The three-paper picture (ME Layer + FocuSFT + Make Each Token Count) is the cleanest top-to-bottom story on attention sinks the wiki has assembled.
Research angle: Three open problems. (1) The bilevel inner loop is expensive; the paper does not benchmark training cost against a strong long-context-SFT baseline at matched final quality. The right comparison is dollars per FocuSFT-quality model. (2) Composition with the ME Layer intervention is the natural simplification, if a one-shot ME Layer fix gets most of the gain, FocuSFT's bilevel structure may be overkill. (3) Composition with Make Each Token Count is the cleanest near-term experiment: training-side sharpening plus inference-side selection. If the gains stack, the long-context cost floor drops on both axes simultaneously.
SemiAnalysis Cerebras IPO: the fast-tokens economy
Past a capability threshold, developers prefer faster tokens to smarter tokens. SRAM-based machines win on interactivity-per-watt in a way HBM GPUs cannot match. Cerebras IPO is the market test.
Source: SemiAnalysis newsletter (Gmail-starred) Links: Newsletter · Wiki Tier: 1. Hardware-bounded inference, wafer-scale engines, deployment economics
The piece is four normal-articles long and runs the IPO-eve thesis in five sections: fast inference, WSE-3 (Cerebras's wafer-scale chip), CS-3 (the system), a BOM cost analysis, and the conditions under which the wafer wins. The frame for the wiki: this is the first long-form industry analysis that prices the "fast tokens" thread quantitatively. Until today the wiki has been tracking inference-capacity binding as a series of capital signals (Anthropic-Colossus 05-08, ByteDance $30B 05-08, Broadcom-OpenAI-Microsoft 05-10). This piece is the demand-side framing for why all those deals make sense.
The central claim, in the piece's own framing: "past a certain threshold of intelligence, developers prefer faster tokens to smarter tokens." Anthropic's Opus 4.6 Fast tier (6x the standard price for 2.5x interactivity, now degraded to 1.75x as load has grown) is the revealed-preference data point. The OpenAI 750MW Cerebras compute deal is the supply-side validation, OpenAI is willing to pay tens of billions for capacity on a hardware platform that does not match GPU throughput precisely because it dominates on interactivity-per-watt. The piece walks the WSE-3 BOM in detail and argues that the SRAM-per-FLOP ratio fundamentally shifts the energy ceiling for fast inference.
The wiki has been tracking the capacity-binding-constraint thread for two months. Every deal in the sequence (ByteDance, Broadcom-OpenAI, Anthropic-Colossus, Cerebras IPO today) is implicitly the same thesis: token rate at quality is bounded by something other than peak FLOPS. The SemiAnalysis piece names that something explicitly as interactivity-per-watt. The natural near-term consequence is a benchmark: InferenceMax with watts, measuring energy-per-token at fixed quality across NVIDIA, Cerebras, Groq, AMD MI300, and TPUv5e. Whoever ships it sets the evaluation standard for the next two years of inference research.
The routing implication is direct. The wiki's tracked routing systems (TraceR 04-17, Netflix State of Routing 05-08, CARE 05-11, Sakana Conductor 05-11) all optimize on latency or accuracy or both. None of them optimize on energy-per-token at quality. If the fast-tokens thesis holds, the routing literature gets rewritten on an energy objective in the next 90 days.
Why it matters: This is the deepest industry analysis of inference-capacity economics the wiki has tracked. The piece sets up at least two Tier 1 research directions (energy benchmark, energy-aware routing) and prices the demand-side thesis that has been driving every capacity deal in the last two months.
Research angle: The piece is industry analysis, not research. The research agenda it sets up: (1) InferenceMax with watts as a measured benchmark; (2) routing-as-energy-allocation, re-derive existing routing systems under an interactivity-per-watt objective; (3) PUE-conditioned pricing, do API prices start to correlate with regional grid carbon intensity by 2027?
Industry Pulse
- Anthropic overtakes OpenAI in B2B for the first time (The Decoder). 34.4% vs 32.3% on the Ramp AI Index. Anthropic quadrupled in one year. Same-day Claude for Small Business launch with 15 agentic workflows (QuickBooks, PayPal, HubSpot) and a ten-city US workshop tour. The lead may erode under cost pressure or OpenAI's Microsoft channel; the structural shift is real. Pairs with the Claude Platform on AWS GA from last week. → Wiki summary
- Recursive emerges from stealth with $650M (The Decoder). Second high-profile RSI-and-experience-RL lab to surface in two weeks, after Ineffable Intelligence. RSI is now a venture-funded category. The technical foundation (G-Zero 05-12, RLRT 05-12, today's Sparse-to-Dense Reward Principle) is consolidating fast. The Many Faces paper today flags a foundational risk: OPSD-style self-improvement only works when the privileged information is a shared latent rule. Whether RSI extends to instance-specific problem-solving is the open question. → Wiki summary
- Google hires hundreds of customer engineers (The Decoder). Mirrors OpenAI Deployment Company (05-11) and Anthropic's Claude for Small Business today. Deployment-services category is now real across all three frontier labs in one week.
- Tencent ramps AI spending; China component shortages (Tencent, shortages). Tencent's Q1 was strong and the company plans a capex ramp; it is reportedly in talks for a Deepseek stake. At the same time, Bloomberg reports that Chinese AI hardware suppliers cannot keep up with demand because of component scarcity. The capacity-binding-constraint thread continues; the Tencent move says domestic chip supply is improving enough to bet on.
- Meta AI Incognito Chat (The Decoder). Server-side enclave processing, ephemeral history, Zuckerberg claims first-lab status for this level of private AI usage. Pairs with the Anthropic Cyber and private-deployment thread that has been running.
- Luma Uni-1.1 image API (The Decoder). $0.04 per image at 2,048px, Arena leaderboard rank 3 behind Google and OpenAI. Web search, built-in reasoning, up to 9 reference images. Frontier image-API competition is tightening from 2 providers to 3.
- DeepMind Pointer Engineering (The Decoder). Reframes the mouse cursor as the key context-engineering variable for Gemini Intelligence on Googlebook. The interface-as-context move is now explicit on the Google side.
- TLDR AI: Opus 4.7 Fast, Qwen Image 2.0, serverless GPUs (TLDR AI). Standard newsletter roundup. Opus 4.7 Fast is the latest entry in the fast-tier story that the SemiAnalysis piece prices in detail.
Connecting the Dots
Research papers Industry / market
─────────────── ──────────────────
Sparse-to-Dense (allocation) ─► Recursive $650M for RSI
Many Faces (failure taxonomy) │ depends on OPSD-style
│ │ self-improvement working
│ │ Many Faces says this works
▼ │ only for shared-rule PI
OPD theory layer complete ────────┘
FocuSFT (training-side sinks)
Make Each Token Count (inference)
Massive Activations ME Layer (mechanism)
│
▼
Top-to-bottom story on attention sinks
SemiAnalysis Cerebras ─────────► Anthropic B2B lead (34.4%)
(interactivity-per-watt thesis) │ Opus 4.6 Fast = revealed
│ │ preference for fast tokens
▼ │
Energy-bound inference │
(formal version lands 05-14) │
▼
Three labs, deployment-services week
Anthropic + OpenAI + Google
Cross-paper thread #1: the OPD theory layer is now complete. Six papers in two months have established the empirical and now theoretical structure of on-policy distillation. TIP (04-16) said only 10% of distillation tokens carry signal (token-level). LongAct (04-18) extended the same principle to RL gradients. CoPD (05-01) made the distillation step parallel-experts with bidirectional OPD. D-OPSD (05-07) used conditioning asymmetry as the neutral channel. RLRT and G-Zero (05-12) extracted value from the teacher-student delta two different ways. Today's Sparse-to-Dense paper adds the allocation rule (where to spend scarce labels first), and the Many Faces paper adds the failure taxonomy (three mechanisms when the dense step breaks). The Extrapolation Cliff (lands 05-14) will add the closed-form operating-point bound. Seven papers, one layered theory: allocate (Sparse-to-Dense), execute (TIP / LongAct / CoPD / D-OPSD), exploit deltas (RLRT, G-Zero), avoid known failure modes (Many Faces), bound aggressiveness (Cliff). The empirical era is becoming the textbook era.
Cross-paper thread #2: the attention-sink story is now top-to-bottom. Three papers in one day form a complete diagnosis. The Massive Activations ME Layer paper identifies a single layer in each model family where the massive activation token is born (RMSNorm + FFN parameters jointly produce it). FocuSFT shows that the same massive activation creates attention sinks during long-context SFT, where positional biases starve content tokens of attention budget and weaken their gradient signal. Make Each Token Count (05-12) showed that even at inference, the resulting attention dilution is dilutive enough that selective eviction beats the full cache. Mechanism (ME Layer), training-time damage (FocuSFT), inference-time mitigation (Make Each Token Count). Anyone working on long-context capability now has a complete causal chain to design against.
Cross-paper thread #3: rubric reward modeling has a hacking surface. The 12-May digest's Worth Watching predicted multimodal rubric overfitting in 60 days. The Reward Hacking paper today resolves the prediction in 24 hours. Three named failure modes (compound-criterion partial satisfaction, implicit-as-explicit, imprecise topical matching). Cross-source: Kurate cs.LG #9 this week is Helff et al.'s "LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking" (ai_rating 6.8/10), an independent paper making the same diagnosis from a different angle. Two papers in one week, two independent groups, same conclusion: the rubric-as-reward thread is real and useful but introduces a new class of reward-hacking that scalar RMs did not have. The next paper to ship is the meta-rubric designed to catch all three named failure modes.
Cross-paper thread #4: SemiAnalysis is the demand-side framing for the capacity binding constraint. The wiki has tracked four capacity deals in two weeks (ByteDance $30B 05-08, Anthropic-Colossus 05-08, Broadcom-OpenAI-Microsoft 05-10, NVIDIA $40B 05-11) plus today's Cerebras IPO. The deals are the supply side. SemiAnalysis is the demand side: developers and labs are willing to pay 6x for 2.5x interactivity past a capability threshold. The Opus 4.6 Fast premium and the OpenAI-Cerebras 750MW deal are revealed preferences. The arXiv-side formalization (Energy-to-Token position paper) lands 05-14, 18 hours later. Two independent sources arriving at the same conclusion within a day is itself signal that the energy-bounded inference frame is consolidating.
Cross-paper thread #5: interpretability moves toward locatability. The Massive Activations ME Layer paper, Probe&Prefill, and Compliance vs Sensibility (05-02) all locate actionable behavioral structure in shallow, intervenable representations. Single layer for massive activations. Linear hidden-state direction for tool necessity (AUROC 0.89-0.96). Linear direction in activation space for reasoning mode. The pattern is consistent across three papers in 11 days. Production-relevant interpretability is moving from "find features" to "install features" to "probe features cheaply at inference." The Kazemi refusal-neuron retweet (05-12, single MLP neuron bypasses safety alignment across 7 dense transformers) is the safety-side variant of the same pattern. Five papers, one direction: behavior is locatable in shallow representations, and the locations are deployable.
Reddit practitioner-side signal. The r/LocalLLaMA top posts for the day include Needle (26M parameter attention-only model distilled from Gemini for function calling, runs at 6000 prefill / 1200 decode tok/s on consumer devices; "tool calling is retrieval-and-assembly, not reasoning") and Gemma 4 MTP vs DFlash benchmarks on 1x H100 (MTP 3.11x and DFlash 3.03x faster than baseline on Gemma 4 31B dense; DFlash flips ahead on Gemma 4 26B-A4B MoE). Needle is the practitioner-side confirmation of Probe&Prefill: if tool-necessity is linearly decodable from hidden state, you do not need FFN capacity for the decision. The Gemma 4 MTP benchmark is the practitioner-side complement to today's speculative-decoding research: 3x speedups on consumer GPUs with MTP are the real-world numbers behind the wiki's NeMo-RL speculative-decoding entries.
Worth Watching
- Sparse-to-Dense rule beyond verifiable math, 90 days. The bridge (forward-KL warmup + OPD) is validated on MATH and AIME. Whether the same allocation rule survives in rubric-based RL is the open question. Today's Reward Hacking paper says rubric verifiers have specific failure modes; the bridge step uses an OPD-style verifier implicitly. Falsifiable: a paper that applies Sparse-to-Dense to a non-verifiable domain (medical reasoning, creative writing) and reports whether the bridge advantage holds.
- OPSD identifiability criterion, 90 days. The Many Faces paper's third failure mode argues OPSD works only when the privileged information is a shared latent rule. The formal criterion is, conditioning distribution induces same marginal as unconditioned learning would. A paper that proves this rigorously and gives a test for whether a given privileged information satisfies it would close the OPD theory loop.
- Composition of FocuSFT and ME Layer intervention, 60 days. The ME Layer paper offers a one-shot training-free fix that reduces the same sink mass that FocuSFT addresses with bilevel optimization. If the ME Layer fix gets most of the gain, FocuSFT's expensive inner loop is overkill. Falsifiable: a paper that benchmarks both on BABILong and RULER and reports the cost-quality tradeoff.
- InferenceMax with watts, 90 days. The SemiAnalysis piece argues the binding constraint for inference is interactivity-per-watt. Whoever ships a benchmark that measures energy-per-token at fixed quality across NVIDIA, Cerebras, Groq, AMD MI300, and TPUv5e sets the evaluation standard for two years. Falsifiable: such a benchmark, with stable methodology, by Q3 2026.
- TST at frontier scale, 120 days. Token Superposition Training claims 2.5x at 10B-A1B. Whether a non-paper-author lab reproduces it at 70B or higher determines whether this becomes standard practice. Falsifiable: a public report from a frontier-scale training run using TST with comparable speedup.
- Meta-rubric for the three named failure modes, 60 days. Reward Hacking in Rubric-Based RL gives the failure taxonomy (compound-criterion partial satisfaction, implicit-as-explicit, topical drift). The next paper should propose a meta-rubric whose criteria specifically catch each of the three failure modes. Falsifiable: a paper that applies rubric RL with the meta-rubric and shows reduced verifier exploitation at matched accuracy.
- LLM-rated underrated from Kurate cs.LG #9 (Helff sycophancy/gaming). "LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking" (Helff et al., arxiv 2604.15149, ai_rating 6.8/10, cs.LG #9). This is the cross-source confirmation paper for today's Reward Hacking in Rubric-Based RL. Two independent papers, one week, same diagnosis. Worth reading as the LLM-rated complement.
- LLM-rated underrated from Kurate cs.AI #5 and #9. "AI scientists produce results without reasoning scientifically" (ai_rating 8.5/10, cs.AI #5) and "IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures" (ai_rating 7.8/10, cs.AI #9). The first is the responsible-ai measurement-crisis paper, complement to Soohak from 05-12. The second is the deployment-relevant pre-registered evidence that safety measures sometimes harm; a deployment lead should read it before shipping a safety filter.
- Rising authors from Kurate: no authors crossed threshold this week. No new handles to add to
connectors/twitter/config.json:ai_handles.
Quick Hits
Pion optimizer (arXiv 2605.12492). Spectrum-preserving optimizer via orthogonal equivalence transformation. Unlike Adam/Muon, updates each weight matrix via left/right orthogonal transformations that preserve singular values. Yields stable competitive alternative to standard optimizers for both LLM pre-training and fine-tuning. Tier 2 architecture-aware optimization, worth tracking if the stability claim holds at scale.
Massive Activations ME Layer (arXiv 2605.08504). Identifies a single layer in each model family (the Massive Emergence Layer) where attention-sink-producing massive activations are born by joint RMSNorm + FFN action. Once formed, the massive activation token representation remains largely invariant across deeper layers, reducing representational diversity. Simple intervention reduces rigidity and improves performance in both training-free and fine-tuning settings, selectively weakening attention sinks at the hidden-state source. Tier 2 interpretability with Tier 1 long-context intersection. → summary
Probe&Prefill (LLM agents already know when to call tools) (arXiv 2605.09252). When2Tool benchmark across 18 environments. Linear probe on pre-generation hidden state predicts tool necessity at AUROC 0.89-0.96, substantially exceeding the model's own verbalized reasoning. Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss. The "verbalized reasoning is a degraded readout of internal state" claim is the structurally interesting part. → summary
Useful Memories Become Faulty (arXiv 2605.12978). LLM-rewritten agent memory degrades over consecutive updates. GPT-5.4 fails on 54% of ARC-AGI problems it had previously solved without memory after consolidation. Episodic-only retention beats forced consolidation. Strong case against the dominant "consolidate-everything" memory design pattern. → summary
LongMemEval-V2 (arXiv 2605.12493). 451 questions on environment-specific agent experience (static state recall, dynamic state tracking, workflow knowledge, environment gotchas, premise awareness). AgentRunbook-C (store trajectories as files, invoke coding agent in sandbox) reaches 72.5% vs 48.5% best RAG baseline. Coding-agent retrieval is the new Pareto front for accuracy at long-horizon agent memory. Composes cleanly with Useful Memories' raw-retention recommendation. → summary
Agent-BRACE (arXiv 2605.11436). Decouples agent into belief-state model and policy model trained jointly via RL. The belief is a structured set of atomic natural-language claims with verbalized ordinal certainty labels. Policy conditions on compact belief, not full history. +14.5 pp on Qwen2.5-3B and +5.3 pp on Qwen3-4B at near-constant context length. The first clean LLM-native belief-state representation in the wiki. → summary
RubricEM (arXiv 2605.10899). Rubric-guided meta-RL for deep-research agents. Stagewise policy decomposition (plan / evidence / review / synthesize), each stage conditioned on self-generated rubrics. Stage-Structured GRPO for credit assignment. RubricEM-8B approaches proprietary deep-research systems on four long-form benchmarks. Today's Reward Hacking paper is the diagnostic note: even rubric RL has reward-hacking surfaces; RubricEM is the production-scale version of the technique under stress test.
LoopUS (arXiv 2605.11011). Post-training framework that converts a standard pretrained LLM into a looped latent-refinement architecture (encoder + looped block + decoder) with selective gating, random deep supervision, and confidence-based early exit. Reasoning gains without extending generated traces or recurrent pretraining. Tier 2 latent-refinement architecture, complement to the looped-test-time-compute thread.
Multi-Stream LLMs (arXiv 2605.12460). Replaces single-stream chat format with multiple parallel input/output streams. Model can read while writing, think while acting, separate roles into independent streams. Improves efficiency, security (separation of concerns), and monitorability. Tier 2 agentic-architecture proposal.
Beyond Reasoning: RL Unlocks Parametric Knowledge (arXiv 2605.07153). Zero-shot closed-book QA with no CoT, binary correctness rewards, fact-level train-test deduplication. RL yields ~27% average relative gain. Mechanistically, RL redistributes probability mass over existing knowledge (moves rare correct answers from low-probability tail to greedy generation), not acquires new facts. The hardest examples (answers never appearing in 128 pre-RL samples) drive 83% of the gain. Repositions RL as a tool for unlocking, not acquiring, latent parametric knowledge.
Missing Old Logits in Asynchronous Agentic RL (arXiv 2605.12070). Asynchronous RL pipelines lose the historical training-side logits needed for proper PPO off-policy correction. Three exact recovery strategies (snapshot tracking, dedicated old-logit model, partial rollout sync) and an approximate PPO-EWMA fix. Practical infrastructure paper for any team running long-horizon agentic RL.
Learning, Fast and Slow (arXiv 2605.12484). Fast-Slow Training (FST): model parameters as "slow weights," optimized context as "fast weights." Fast weights absorb task-specific information via textual feedback; slow weights stay close to base. Up to 3x more sample-efficient than RL-only across reasoning tasks; 70% less KL divergence to base model; preserves plasticity for subsequent task. Continual-learning angle.
Teaching Language Models to Think in Code (ThinC) (arXiv 2605.07237). Tool-integrated reasoning where code (not NL) is the reasoner. NL plans briefly; all reasoning unfolds through code blocks. ThinC-4B outperforms every TIR baseline on five competition-math benchmarks and even surpasses Qwen3-235B-A22B-Thinking. 99.2% of final answers grounded in interpreter output.
Geometric Factual Recall in Transformers (arXiv 2605.12426). Subject embeddings encode linear superpositions of attribute vectors, MLP acts as a relation-conditioned selector via ReLU gating, not associative key-value mapping. Logarithmic embedding dimension suffices for memorization. Cross-references the MIT superposition scaling laws (05-03). Clean theoretical result on the geometry of factual memorization.
Recursive emerges from stealth with $650M (The Decoder). Second high-profile RSI lab in two weeks. RSI is a venture-funded category now. The Many Faces paper today flags a foundational technical risk for RSI: OPSD-style self-improvement only works when the privileged information is a shared latent rule. The category and the technical foundation are moving in parallel. → summary
r/LocalLLaMA practitioner reports. Needle 26M tool-calling distilled from Gemini (attention-only architecture, runs at 6000 prefill / 1200 decode tok/s on consumer devices; "tool calling is retrieval-and-assembly, not reasoning") confirms Probe&Prefill's mechanistic claim from the practitioner side. Gemma 4 MTP vs DFlash on 1x H100 (MTP 3.11x, DFlash 3.03x faster on 31B dense; DFlash flips ahead on 26B-A4B MoE) is consumer-GPU-side confirmation of the speculative-decoding thread. MagicQuant v2.0, a hybrid GGUF quant mixer that learns from Unsloth and other model quant patterns; specific to Qwen3.6 27B's weird quant-sensitivity profile. llama-eval PR by ggerganov adds a llama.cpp evaluation harness as an official example.
Pragmatic Engineer with Anders Hejlsberg (newsletter). Anders on TypeScript, C#, Turbo Pascal. The "training-data volume is what makes AI great at TypeScript and Python" observation is the developer-tooling-side complement to the deployment-services week.
Algorithmic Bridge: Pangram near-perfect FPR for AI slop (Algorithmic Bridge). 1-in-10,000 FPR on test documents, 1-in-100,000 on arXiv held-outs. Used to claim 21% of ICLR 2026 reviews are fully AI-generated. The author concedes Pangram's FNR is undermeasured but argues FPR-optimized stance is the structurally honest one. Tier 2 responsible-ai signal.
Ken Huang's Agentic AI Harness Pattern (Substack). 10 new agentic patterns (cost & token accounting, cancellation, slash commands, working-directory resolution, trajectory compression, terminal UI, migrations, plugin discovery, specialized subagents, credential lifecycle). Production-agent plumbing catalog. Useful reference for anyone shipping agents.
Simon Willison: CSP Allow-list, Datasette blog launch (CSP, Datasette blog). CSP experiment is the operational primitive for sandboxed-iframe AI agents that need network egress. Datasette blog launch is a small note: Simon used Codex desktop and the Markdown session-transcript export feature.
Twitter signal: AI Co-Mathematician (DeepMind) hits 48% on FrontierMath Tier 4 (@dair_ai retweet, arXiv 2605.06651). New high among AI systems on the hardest math tier. Asynchronous stateful workbench for mathematicians; ideation, literature discovery, computational analysis, theorem verification, knowledge development. Already in wiki via the 2026-05-09 entry; today's tweet is the public announcement.
Twitter signal: SlimQwen MoE pruning + distillation (@bayesiansapien retweet, arXiv 2605.08738). Pruning a pretrained MoE consistently outperforms training the target architecture from scratch at the same budget. Combining KD with LM loss outperforms KD alone, especially on knowledge-intensive tasks. Progressive pruning schedules beat one-shot. Practical recipe for MoE compression in production.
Twitter signal: Mira Murati's interaction models (retweet). "A new class of model trained from scratch to handle real-time interaction natively, instead of gluing it onto a turn-based one." Light on specifics; worth tracking once a paper drops.
NPM/PyPI supply-chain attacks (TanStack, Mini Shai-Hulud). 42 @tanstack/* packages compromised; spread to OpenSearch, Mistral AI, Guardrails AI, UiPath on PyPI. Malware specifically targets AI developer tooling (Claude Code settings.json and VS Code tasks.json). Not a research signal but a deployment-side reminder that AI dev tooling is now a high-value attack surface.
Sources ingested today: HF (71 papers, 11 Tier 1/2 summarized), RSS (17 posts for 2026-05-13 including SemiAnalysis Cerebras long-form), Gmail (no new starred this date), Twitter morning slot (35 tweets / 18 articles, 20 retweets) + afternoon (1 tweet) + evening (3 tweets), Kurate cs.AI + cs.LG weekly leaderboards (no rising authors crossed threshold), Reddit (8 subs, ~15 posts after filters with r/LocalLLaMA + r/CUDA + r/MachineLearning the highest signal) | Wiki pages updated: 11 (5 Tier 1 summaries: Sparse-to-Dense, Many Faces, Token Superposition, δ-mem, FocuSFT; 1 Tier 1 industry: SemiAnalysis Cerebras IPO; 4 Tier 2 summaries: Reward Hacking in Rubric RL, Massive Activations ME Layer, Useful Memories, LongMemEval-V2, Agent-BRACE, Probe&Prefill; 2 Tier 2 industry: Recursive $650M, Anthropic B2B lead; 3 concept-page updates: kv-cache.md, knowledge-distillation.md, rl-for-llms.md)