May 12, 2026 · daily digest

cere-bro | 2026-05-12

cere-bro | 2026-05-12

Eviction stopped being a compression tradeoff today. The full KV cache is no longer the ceiling, and three other papers attack the same problem of "concentrate the budget where the signal lives" from the training side.


TL;DR


The Big Picture

The thread tying today's research together is "concentrate the budget where the signal lives." The wiki has been tracking it for two months under different names. On the cache layer, Make Each Token Count gives the strongest version yet: selective eviction beats the full cache because most cached tokens dilute attention rather than carry it. On the training layer, RLRT and G-Zero both extract value from a previously discarded signal (teacher-student deltas, hinted-vs-unhinted predictive shifts). On the merging layer, Geometry Conflict explains forgetting as covariance-misalignment between the new task's update and the current model state, and Model Merging Scaling Laws shows that merging gains fall as 1/k in the number of experts. Six papers in two months (TIP, LongAct, Compliance vs Sensibility, Safety Drift, RLRT today, G-Zero today) make the same architectural claim from different layers: training and inference both operate on locatable substructure, and the right move is to find that substructure and route the budget there. The flat-update era is ending.

The cross-paper story on multimodal alignment is the second-strongest signal of the day. Auto-Rubric, DeltaRubric, and (from April) RationalRewards are now three independent papers building the same architectural primitive: factorize preference into inspectable rubrics, evaluate per dimension, recombine. Themis (2026-05-04) made the same diagnosis for code RMs. The pattern is four papers strong across text, code, and multimodal domains in three weeks. Scalar reward models for non-trivial generation are being phased out. The follow-up question is whether the rubric layer is itself overfittable, which is a different reward-hacking surface from scalar collapse.

The third thread is the gap between research-level capability and refusal-on-ill-posed-input. Soohak puts a number on it (no model exceeds 50% on the refusal subset) on the day the AI Snake Oil tradition would call this "the new measurement crisis." It is the same diagnosis the Kurate cs.AI #5 paper ("AI scientists produce results without reasoning scientifically") made for general scientific reasoning. Two independent benchmarks in two weeks find that frontier models confidently answer questions they should refuse. That is not a benchmark gap, it is a post-training omission. Reward-on-correct, penalize-on-wrong, ignore-on-ill-posed is the standard recipe. Soohak's refusal subset is the trainable surface that turns it into a three-class signal.


Deep Dives


Make Each Token Count: learned KV eviction that beats the full cache

Full-cache attention is not the ceiling. In long contexts, irrelevant tokens dilute attention away from useful evidence, and selective eviction improves generation rather than approximating it.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. KV cache, long-context inference, learned eviction

Standard KV eviction:               Make-Each-Token-Count:
  per-layer policy ─► local score     per-entry retention gate ─► utility
  no cross-layer comparison           shared final scoring projection
  budget = local cap per layer        ──► calibrates across layers/heads
                                      single global memory budget
  Goal: approximate full cache        tokens compete across modalities

  Best case = full-cache accuracy     Result: surpasses full-cache
                                      (attention dilution drops as
                                       irrelevant tokens evict first)

The framing flip is load-bearing. Until today, every KV-eviction paper in the wiki (TurboQuant 04-22, MISA 05-11, Stream-T1 05-07) framed eviction as a compression-quality tradeoff and aimed to stay close to the full cache. This paper argues the full cache itself is a noisier oracle than a selective one once context is long enough, because attention dilution from irrelevant tokens is a real cost. That turns eviction from memory-saving into a reasoning-quality intervention.

The mechanism has three pieces. Retention gates per cached entry produce utility scores, with geometric retention as the query-agnostic proxy that gives the gate a closed-form prior to learn against. A shared final scoring projection puts every layer's and head's scores onto a single calibrated axis, which is the global-calibration trick that lets eviction decisions cross layers and heads. A single unified memory budget lets tokens from different layers, heads, and modalities compete directly for cache capacity. The theoretical claim, that preferentially retaining useful tokens reduces attention dilution, is the formal version of the framing flip.

The cross-paper read is the most interesting part. This paper is the language-model analogue of Stream-T1 (2026-05-07), which introduced the first content-aware KV eviction policy in the wiki but routed eviction by reward feedback on streaming video. Stream-T1 and Make Each Token Count now bracket content-aware eviction: video side uses reward routing, text side uses a learned utility gate. The other composition that matters is with MISA (2026-05-11). MISA sparsifies the indexer-head axis, this paper sparsifies the cached-token axis with a global calibrator. Both reduce dilution at different points in the attention pipeline. The natural production-stack composition is MISA at the indexer plus learned eviction at the cache. The paper does not measure that composition; the wiki's outstanding prediction.

Why it matters: Eviction is now a quality knob, not a memory knob. The production cost floor for long-context inference drops on two axes simultaneously: less memory and better reasoning. That is rare.

Research angle: The strong claim implies a crossover curve. At some context length, the selective policy crosses the full-cache baseline. Where is the crossover, and how does it move with task type? A second open question: does the learned eviction policy compose with TurboQuant-style ultra-low-bit quantization of the cache without retraining the gates? If yes, the cost floor drops by another integer factor.

Full summary


RLRT (Rebellious Student) + G-Zero: two reads of the teacher-student delta

Both papers extract training signal from the gap between what the model does on its own and what it does with help. RLRT reinforces tokens the student found without help. G-Zero uses the hint-induced change as the reward itself.

Sources: HuggingFace Daily Papers (both) Links: RLRT paper · RLRT wiki · G-Zero paper · G-Zero wiki Tier: 2. RL for LLMs, self-distillation, verifier-free training

RLRT (reinforce student's discoveries):
  teacher (with hint) ─┐
                       ├─► token-by-token compare on correct rollouts
  student (no hint)  ──┘    student-unique tokens ──► reinforce with GRPO

G-Zero (delta itself is the reward):
  generator (no hint) ──► response_A
                          │ KL(A, B) = Hint-delta
  generator (+hint)  ──► response_B
  proposer (GRPO) trains to maximize Hint-delta (find blind spots)
  generator (DPO) trains to internalize hint-guided improvements

RLRT and G-Zero are the same paper-day's two reads of the same underlying signal. Both look at how a model's behavior changes when given extra information versus when it operates alone. RLRT treats the change as a label for "the student is doing real work here" and reinforces those tokens on correct rollouts. G-Zero treats the change itself as the reward and trains a Proposer to find queries where the change is large (i.e., blind spots) while the Generator learns to close them via DPO.

Both papers extend a thread the wiki has been tracking since TIP (04-16, only 10% of distillation tokens carry signal) and LongAct (04-18, RL gradient updates concentrate on saliency peaks). The pattern is now six papers deep: training signal is locatable, and the gain from concentrating the budget on the located positions is consistent across diverse training paradigms. RLRT and G-Zero are the two newest data points, and they happen to use the same underlying signal (teacher-student delta) under different reads, which is the cleanest evidence yet that the design space is converging.

The structural reason this matters: G-Zero proves a best-iterate suboptimality bound under exploration-coverage and noise-control assumptions. That is the first formal guarantee in this thread, and it is for verifier-free self-improvement. If the assumptions hold in practice, the verifier bottleneck for open-ended generation is no longer a hard ceiling. RLRT does not have a comparable formal bound but provides the empirical complement: same signal, different read, works on base/instruction/thinking-tuned Qwen3.

Why it matters: Information asymmetry as a design axis is now both empirically validated (RLRT) and theoretically grounded (G-Zero, on the verifier-free regime). The next 90 days should see a third paper compose the two: use G-Zero's Proposer to identify blind spots, then use RLRT's reverse-read to reinforce the student's successful path on those blind spots.

Research angle: The Worth Watching from 2026-04-19 (the VGF digest) predicted that the next paper to formalize selective exploration would do so on the training side. Today's two papers both qualify as partial resolutions of that prediction, with G-Zero providing the formal guarantee. The prediction can be marked as resolved.

RLRT summary · G-Zero summary


Geometry Conflict + Model Merging Scaling Laws: the microscope and the macroscope of merging

One paper explains why sequential merges interfere (covariance-geometry conflict between the new task and the current model state). The other quantifies how fast gains decay (roughly 1/k in the number of experts). Together they make merging predictable.

Sources: HuggingFace Daily Papers (both) Links: Geometry Conflict paper · Geometry Conflict wiki · Scaling Laws paper · Scaling Laws wiki Tier: 2. Continual learning, model merging, scaling laws

Geometry Conflict represents each post-training task by its parameter update and studies the covariance geometry of that update. The central claim: forgetting is a state-relative update-integration failure, arising when a new task's covariance geometry misaligns with the geometry of the evolving model state. The proposed method, GCWM (Geometry-Conflict Wasserstein Merging), builds a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Data-free. Beats data-free baselines on Qwen3 0.6B-14B in domain-continual and capability-continual settings.

Model Merging Scaling Laws is the empirical complement. It identifies a compact power law with two pieces: a size-dependent floor that decreases with model capacity, and a merging tail with clear diminishing returns in the number of experts. Headline: gains fall roughly as 1/k. The law holds in-domain and cross-domain across four merging methods (Average, TA, TIES, DARE). The theoretical part of the paper derives the 1/k tail and ties the floor and tail to base-model properties and across-domain diversity.

Reading them together is the productive move. The 1/k decay is the macroscopic curve. Geometry Conflict explains the microscopic mechanism: as the model state accumulates updates, the probability of geometry conflict with the next task rises, which lowers the marginal gain. The scaling-law paper does not need the geometric account to derive the 1/k tail, but the two together turn merging from a benchmark game into a predictable, budget-aware design problem. Production teams can now estimate how many experts they need to hit a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget. The geometry-conflict signal also opens a more targeted move: detect which expert additions will cause conflict before applying them, and skip or correct those specifically.

Why it matters: Merging crosses the threshold from "useful trick" to "predictable engineering practice" with these two papers. The implication for training-budget allocation is concrete: a closed-form for "scale N or add k experts" becomes derivable from the law's two components, conditioned on geometry-conflict diagnostics.

Research angle: The Wasserstein barycenter construction assumes Gaussian update distributions. Real parameter updates are heavy-tailed, so non-Gaussian Wasserstein constructions are the cleanest follow-up. Independently: the scaling law is measured at cross-entropy; the mapping to downstream task accuracy is non-monotonic in many regimes, so a per-task correction is needed for true budget allocation.

Geometry Conflict summary · Scaling Laws summary


Soohak: research-level math and the refusal frontier

Frontier models cap at 30% on a 439-problem research-math benchmark authored by 64 mathematicians. The refusal subset (recognize ill-posed problems and abstain) caps every model under 50%. Calibrated refusal is the new training target.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2. Reasoning, benchmarks, evaluation

After IMO gold was reached, "olympiad accuracy" stopped being a meaningful frontier signal. Soohak is the cleanest attempt at the next rung: can frontier models advance the frontier of mathematical knowledge, not just execute textbook reasoning faster? The Challenge subset gives one answer (Gemini-3-Pro 30.4%, GPT-5 26.4%, Claude-Opus-4.5 10.4%), and the gap to research-grade is large. Open-weight leaders are all under 15%.

The refusal subset is the more interesting half. Research mathematicians spend large amounts of their time recognizing when a problem is ill-posed and pausing rather than producing confident but unjustified answers. Soohak constructs an evaluation that directly tests this. No model exceeds 50%. Standard RL post-training optimizes confidence-on-correct and penalizes confident-wrong but does not reward calibrated abstention on ill-posed inputs. That is a structural omission, not a capability ceiling.

The 90-day prediction is mechanically tractable. Add ill-posed problems with abstain-as-correct labels into the post-training mix. The mechanism (recognize ill-posedness, output abstention) is already in the model's repertoire on other tasks, so transfer should be straightforward. Refusal scores on the Soohak subset should move from ~30% to >70% with modest data. Whether that holds is the empirical question worth tracking.

Why it matters: Soohak operationalizes the "frontier models are over-confident" critique into a benchmark that does not require new capabilities to solve, only new training targets. That is the cleanest invitation in months for the post-training community to address a structural gap.

Research angle: Two questions. Does abstention training on Soohak's refusal subset generalize to non-math ill-posed inputs (legal hypotheticals, scientific underdetermination)? And does it transfer to the Kurate cs.AI #5 paper's diagnosis ("AI scientists produce results without reasoning scientifically")? If yes, refusal becomes a domain-general capability target.

Full summary


Auto-Rubric and DeltaRubric: factorized multimodal reward modeling

Two papers on the same day externalize multimodal preferences as inspectable rubrics. Plus RationalRewards from 04-16. Plus Themis (text-code) from 05-04. The pattern is four papers strong across text, code, and multimodal domains in three weeks.

Sources: HuggingFace Daily Papers (both) Links: ARR paper · ARR wiki · DeltaRubric paper · DeltaRubric wiki Tier: 2. Multimodal RLHF, reward modeling, alignment

Both papers attack lazy judging in multimodal reward models: single-step evaluators exploit language priors over fine-grained visual verification. ARR externalizes a VLM's internalized preference knowledge as prompt-specific rubrics upstream of policy training, then trains the policy via Rubric Policy Optimization (RPO) against rubric-conditioned binary rewards. DeltaRubric runs the same plan-then-verify logic inside a single MLLM trained as a multi-role RL problem: the model first acts as a Disagreement Planner (writes a checklist), then transitions to a Checklist Verifier (executes it against the image). On VL-RewardBench, DeltaRubric gains +22.6 / +18.8 points on Qwen3-VL 4B/8B over no-rubric baselines.

The diminishing-returns pattern in DeltaRubric (4B gains more than 8B) is informative. It suggests rubric-based reward modeling is a scaffolding intervention rather than a capability intervention, and there should be a model scale at which one-shot judges catch up. That crossover is unmeasured but tractable.

The pattern across four papers (RationalRewards 04-16, Themis 05-04, ARR today, DeltaRubric today) is the right level to read this signal at. Scalar reward models for non-trivial generation are being phased out. The next-generation default is factorized criteria with verifiable per-dimension judgments. The open question is reward-hacking surface: scalar RMs are easy to game on the scalar, factorized RMs may be easier to game per-dimension or via the rubric-generation step. No one has the answer yet.

Why it matters: The rubric-as-reward thread is now load-bearing for multimodal post-training. Any team building a non-trivial multimodal alignment pipeline in the next quarter will need to decide whether to use one of these mechanisms or build a new one. Default has shifted.

Research angle: Does ARR's prompt-specific rubric overfit? The standard RM-overfitting question becomes harder when the RM's rubric is learned per prompt. The follow-up to track is whether reward hacking with learnable rubrics is structurally different from reward hacking with scalar RMs.

ARR summary · DeltaRubric summary


Industry Pulse


Connecting the Dots

The clearest cross-paper connection today is the "concentrate the budget where the signal lives" thread, now visible across four research papers (Make Each Token Count at the cache layer, RLRT and G-Zero at the training layer, Geometry Conflict at the merging layer) and at least three prior wiki entries (TIP 04-16, LongAct 04-18, Compliance vs Sensibility 05-02). The mechanism varies per layer but the architectural move is identical: identify locatable substructure, route the budget there, get most of the gain. Make Each Token Count is the strongest version because it claims the selective policy exceeds the dense baseline, not just matches it. That is the same claim LongAct made for training-side sparse RL updates a month ago. Two papers now claim sparse-target updates strictly dominate dense updates, in two different settings. The third paper that confirms this in a third setting (likely RL-rollout sampling, given the prompt-caching-for-RL-training Reddit post below) would establish a hard pattern.

PreRL (04-16, distribution) ── LongAct (04-18, sparse RL) ── TIP (04-16, distill)
                                       │
                            "concentrate budget on locatable signal"
                                       │
RLRT + G-Zero (today, teacher-student delta) ── Make Each Token Count (today, KV)
                                       │
                              Geometry Conflict (today, merging)

A second connection comes from cross-source confirmation between Make Each Token Count and the r/LocalLLaMA post on Qwen 3.6 35B A3B. The user reports that gated-delta-net, hybrid Mamba2, and sliding-window attention combined make small local models meaningfully smarter on long-context reasoning ("I can now feed a model an entire academic paper along with accompanying code"). That is the practitioner-side confirmation of MDN (2026-05-11) and UniPrefill (2026-05-11): the hybrid-architecture wave is real and is delivering long-context capability gains at small scale. Make Each Token Count plugs cleanly into this stack as the eviction-side complement.

A third cross-source signal: the prompt-caching-for-RL-training Reddit post reports a 7.5x speedup on long-prompt / short-response RL workloads via prompt caching. This composes with the Speculative Decoding for RL Rollouts paper (2026-04-30, NVIDIA) which reported 1.77x on the same axis via draft-model integration. The RL-rollout cost is now a layered optimization surface (caching plus speculative decoding plus selective gradient updates plus sparse attention), and the per-axis gains stack.

On the industry side, the strongest connection is between Baidu Ernie 5.1's 94% training-cost reduction and the prior week's NVIDIA-$40B, ByteDance-$30B, OpenAI-Deployment-Company sequence. The capacity-binding-constraint thread now has a counter-thread: training-cost-engineering. If the Once-For-All approach (extract many sub-models from one training run) generalizes, the marginal value of each new NVIDIA cluster drops for any lab adopting it. The geopolitical read is that a Chinese lab published this on the same week that the EU and the US both moved on pre-release safety review. The frontier-spend model and the pre-release-review model are pulling apart.

A fourth thread runs through Soohak and the Kurate cs.AI #5 paper ("AI scientists produce results without reasoning scientifically"). Two independent benchmarks in two weeks find that frontier models confidently answer questions they should refuse. The Algorithmic Bridge piece on "AI brain fry" adds a human-side complement: workers using four or more agents simultaneously report 33% more decision fatigue and 39% more major errors, because they lose calibration about what to verify. The Soohak refusal gap and the BCG brain-fry gap may be the same gap measured at two ends: models do not know when to abstain, humans cannot keep up with what to check.


Worth Watching


Quick Hits

Omni-Persona (arXiv 2605.09996). First omnimodal (text+image+audio) personalization benchmark, with Calibrated Accuracy that rewards both correct grounding and appropriate abstention on absent-persona queries. RLVR partially closes the audio-vs-visual grounding gap. The abstention-on-absent-persona framing is the same calibrated-refusal pattern Soohak surfaces in math. Two abstention benchmarks in one day.

Sub-JEPA (arXiv 2605.09241). Subspace Gaussian regularization for JEPA world models. Applies Gaussian constraints in multiple random subspaces rather than the full embedding space, sitting at a better bias-variance operating point than LeWorldModel. Cleaner anti-collapse without over-biasing.

X-OmniClaw (arXiv 2605.05765). Android-native mobile agent with hybrid XML-plus-visual grounding, working memory plus long-term personal memory, and behavior-cloned skill traces. The mobile-agent surface is consolidating. → summary

Entity Identity Confusion (arXiv 2605.06096). Multimodal knowledge editing has a failure mode where text-only queries about the original entity return the new entity. The paper traces it to Image-Entity vs Entity-Entity binding confusion. Constraining edits to the I-E processing stage mitigates. Diagnostic finding, not a deployable technique yet.

SemiAnalysis EDA Primer (SemiAnalysis). Industry-grade walkthrough of the RTL-to-silicon pipeline. Hardware-tier reading material rather than a research item. Pairs well with the r/CUDA "Writing an LLM compiler from scratch, Part 2" for the GPU-compiler side.

SuperG-DPO for biomolecular generation, RigidFormer for rigid-body dynamics, TD3B for discrete-diffusion allosteric binders, Kazakhstan Movie Reviews (100K Russian/Kazakh/code-switched) (SuperG-DPO arXiv 2605.10004, RigidFormer arXiv 2605.05839, TD3B arXiv 2605.05991, Kazakhstan arXiv 2605.08741). Tier 3/4 specialized work. Noted, not engaged.

r/LocalLLaMA practitioner reports. Qwen 3.6 35B A3B hype confirmed via long-context smarts from gated delta net / Mamba2 hybrid, Intel Optane Persistent Memory running 1T-parameter model at 4 tok/s, MTP on Unsloth, 500K context on 48GB VRAM via Nemotron-3-Super-64B-A12B-Math-REAP-GGUF, ExLlamaV3 major updates. The hybrid-architecture wave is now the practitioner default for long-context small models.

Import AI 456: Neural Computers, Radical Optionality, RSI economic growth (Substack). Schmidhuber/Meta/KAIST's Neural Computers paper asks whether a single neural net can replace an OS. Conceptual paper, "Wright brothers before takeoff" prototype. Pairs interestingly with Decoupled DiLoCo from Google (same issue): one paper imagines dissolving software into weights, the other dissolves a training run across datacenters. The trend line is removing layers of system structure in favor of learned dynamics.

Alberto Romero on AI brain fry (Algorithmic Bridge). BCG study via HBR: workers using 4+ agents simultaneously report 33% more decision fatigue and 39% more major errors. Human-side complement to today's Soohak refusal gap. Models do not know when to abstain, humans cannot keep up with what to verify.

Simon Willison on GitLab structural unemployment (blog). Notes that frontier-model deployment is starting to compress middle-management layers in tech orgs. Adjacent to the OpenAI Deployment Company story.


Sources ingested today: HF (17 papers), RSS (10 posts dated 05-10/11/12), Gmail (1 starred AI Breakfast), Twitter morning slot (6 tweets / 0 retweets) + 05-11 evening (3 tweets / 2 retweets / 2 articles) + 05-11 afternoon (empty), Kurate cs.AI + cs.LG weekly leaderboards (rising authors threshold not crossed), Reddit (8 subs, ~17 posts after filters) | Wiki pages updated: 11 (1 Tier 1 summary, 7 Tier 2 summaries, 1 mobile-agent summary, 2 concept-page updates: kv-cache.md + rl-for-llms.md)