May 16, 2026 · daily digest

cere-bro | 2026-05-16

cere-bro | 2026-05-16

Three Tier 1 papers in one batch say the same thing about three different layers of the stack. The training loop's "give it all the information uniformly" default is the new wasteful baseline. The reform is to schedule attention, experts, and teachers selectively. On the industry side, Anthropic crossed OpenAI's valuation while Microsoft pulled its internal Claude Code licenses on the same day, because the value capture point in AI is moving from the model API to the agent harness that wraps it.


TL;DR


The Big Picture

A two-month pattern crystallized today, and the easiest way to see it is to walk back the dates. On 04-18, LongAct showed that long-context training-signal density is concentrated in roughly the first 5% of tokens, so most of the gradient update is wasted on the rest. On 04-16, TIP and Make Each Token Count showed something similar at the loss layer: only about 10% of teacher tokens in distillation carry real signal, so weighting all of them uniformly leaves accuracy on the table. On 05-12, Make Each Token Count generalized that argument to the KV cache (the memory store that saves prior attention computations to avoid recomputing them) and proposed learned eviction rather than uniform retention. On 05-14, the Extrapolation Cliff paper derived a closed-form threshold above which uniform on-policy distillation collapses, replacing the "always distill" default with a "distill only when the math says it is safe" rule. On 05-15, SDAR gated the student's absorption of teacher signal with a sigmoid over detached token-level features. Seven papers across two months, each rejecting one specific "treat every X uniformly" default.

Today adds three more layers. Lighthouse Attention rejects uniform pre-training attention by pooling queries, keys, and values into a multi-resolution pyramid and routing through a tiny dense sub-sequence. BEAM rejects uniform top-K Mixture-of-Experts routing by letting each token learn its own expert subset via a trained binary mask. ATESD rejects uniform teacher exposure in self-distillation by making the reveal ratio a learnable control variable scored by a discounted learning-progress reward. The diagnosis is now identical across ten papers and eight layers of the stack: the gradient layer, the token weighting layer, the cache eviction layer, the distillation branch-selection layer, the student gating layer, the pre-training attention layer, the MoE expert-set layer, and the teacher exposure layer. Any "every X gets equal Y" default that still survives in this stack is now the obvious target for the next paper.

The second thread is the deployment substrate convergence. Lighthouse Attention on the open-source training side and Subquadratic's Appen-validated 56.2x speedup at 1M tokens (an independent benchmark validation from a closed lab, reported via Gmail-starred on 05-15) on the production side are arriving in the same week with the same headline structure. NVIDIA's NVFP4 quantization release for Kimi-K2.6 (05-15, an open-weights 4-bit format that takes the model down to fewer bits per parameter without losing much accuracy) handles the per-byte axis. BEAM handles the per-token expert-set axis. Forcing-KV from 05-15 (head-role-conditioned KV cache compression for video diffusion) handles the per-head cache axis. The asynchronous continuous-batching primitive from 05-15 (which overlaps CPU prep of batch N+1 with GPU compute of batch N) handles the scheduling axis. Five independent improvements to the inference and training stack in seven days. None are model changes. The composition multiplies. Nobody has run the joint experiment, but the back-of-envelope number is a 5-10x throughput improvement on the same hardware in 2026 over what was available in 2025, without any new architecture.

The third thread is industry rebalancing. Anthropic crossing OpenAI's valuation, raising another $30B for compute, and simultaneously losing Microsoft as an internal Claude Code customer is not three separate stories. It is one story from three angles. WildClawBench on 05-15 measured an 18-point spread between the worst and best agent harness running the same underlying model on the same 60 long-horizon benchmark tasks. The harness is doing more work than the model. Microsoft just decided that if 18 points of performance lives in the harness, owning the harness is strategic and renting it from a competitor is not. Five frontier labs now run their own coding-agent CLI (Anthropic's Claude Code, OpenAI's Codex with a fresh mobile app, xAI's Grok Build CLI confirmed by The Decoder this morning, Google's Gemini CLI in private beta, and Microsoft's GitHub Copilot CLI). The model API is a commodity. The agent harness is where the user lives and where the spend lands.


Deep Dives


Lighthouse Attention: training-only, kernel-decoupled long-context pre-training

A wrapper that lives only during training, runs no custom kernel, requires no auxiliary loss, and removes itself before the model ships. 1.4 to 1.7 times faster at 98K context, about 17 times faster forward and backward at 512K on a single B200 GPU.

Source: HuggingFace Daily Papers · @NousResearch retweet (2026-05-15) · r/MLScaling discussion Links: Paper · Code · Wiki Tier: 1. Long-context pre-training, GPU efficiency, kernel-decoupled attention

   Standard SDPA at 512K            Lighthouse Attention at 512K
   ────────────────────             ───────────────────────────────
                                    Q ─► pool symmetrically ─► Q'
   O(N²)                            K ─► pool symmetrically ─► K'
   FlashAttention on full           V ─► pool symmetrically ─► V'
   N × N attention matrix

                                    score every pyramid head
                                    top-k cascade picks hierarchical
                                    dense sub-sequence
                                    sort to preserve left-to-right

                                    feed Q', K', V' through ordinary
                                    FlashAttention on the short
                                    selected sub-sequence

                                    short recovery phase removes
                                    wrapper. Model ships as standard
                                    dense-attention.

   1.0x training speed              1.4-1.7x at 98K, ~17x at 512K
                                    (forward+backward, single B200)

The structural novelty is in what Lighthouse refuses to do. It does not replace softmax attention with a state-space machine (SSMs and linear-attention variants are alternative sequence mixers that scale better than O(N²) but produce different inductive biases). It does not write a custom sparse kernel that the GPU vendor's flagship attention kernel cannot reuse. It does not require a straight-through estimator (the trick where you backpropagate through a non-differentiable operation by pretending it is the identity), and it does not need an auxiliary loss. It is a wrapper that intercepts the attention call, runs a cheap pyramid-pooling pass plus a top-k selection, and then calls standard FlashAttention on a much shorter input. Toward the end of training it stops doing even that, and the model is left as a plain dense-attention checkpoint.

The pooling is symmetric across Q, K, and V. That is the move that earlier selective-attention work declined to make. Most prior designs pooled keys and values into a compressed memory store but kept queries at full resolution, treating the cache as an addressable index that the queries probe. Lighthouse pools all three so the queries themselves carry hierarchical structure. This is what lets the gradient-free top-k cascade learn something more interesting than a memory index. The selection becomes structural. The model learns to make queries at multiple resolutions and the cascade picks the appropriate level.

The training-only framing has an unexpected interpretability hook. Each pyramid-head selection is a routing decision the model implicitly learned. Whether those decisions correlate with content boundaries (entities, document breaks, syntax phases) is the open question. If they do, Lighthouse is a self-supervised structural prior for free, in addition to being a speedup.

Why it matters: Two weeks ago the long-context pre-training conversation had two options. Either replace softmax with a state-space model, or write a custom sparse kernel and hope the GPU vendor optimizes it. Lighthouse opens a third path: leave the kernel alone, leave the architecture alone, just change what gets fed into the kernel. The empirical numbers are still small-scale, but the design pattern is much harder to argue with. The same week, Subquadratic (a closed lab) announced an independently-Appen-validated 56.2x speedup at 1M tokens with 81.8% on SWE-bench Verified. Two groups, same axis, same week. The proprietary deployment side and the open-source pre-training side are converging on the same recipe.

Research angle: Four open problems. (1) Lighthouse at 1B+ parameters. Nous's experiments are small-scale; the recovery-phase scaling is the load-bearing experiment. Falsifiable: a paper that runs Lighthouse at 1B+ parameters and shows the recovery-phase loss matches full-attention training on equivalent tokens. (2) Lighthouse for long-context fine-tuning. Most production deployments need long-context supervised fine-tuning, not pre-training. The wrapper should transfer naturally. Untested. (3) Top-k cascade interpretability. If pyramid heads correlate with content boundaries, Lighthouse is also a structural prior. (4) Lighthouse plus Mamba hybrid. SANA-WM (05-15) showed hybrid Mamba plus softmax works for video diffusion; Lighthouse plus Mamba alternating layers is the obvious composition.

Full summary


BEAM: binary expert activation masking ships with a vLLM kernel

Each token learns its own expert subset inside a Mixture-of-Experts model, replacing the fixed top-K rule with a trainable binary mask. 98%+ quality retention at sparsity levels (up to 85% expert-FLOP reduction) where previous methods collapse. The vLLM kernel is what turns the paper into a deployment instead of a curiosity.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. MoE routing, dynamic sparsity, inference efficiency

   Standard Top-K MoE            BEAM (binary mask per token)
   ──────────────────            ──────────────────────────────
   token ─► router ─► top-K      token ─► gating head ─► binary
            experts                          mask over experts
            (K fixed, e.g. 2)              (count adapts; STE
                                            carries gradients;
                                            aux regularizer
                                            enforces budget)

   redundant compute on
   easy tokens                   end-to-end training induces
                                  dynamic sparsity (no
   suboptimal on hard            train-inference mismatch)
   tokens
                                  ↓
                                  custom CUDA kernel + vLLM
                                  >98% retention
                                  up to 85% MoE FLOP reduction
                                  2.5x decoding speedup
                                  1.4x throughput

A Mixture-of-Experts (MoE) model lets each input token activate only a small subset of specialized sub-networks rather than the full feed-forward block. Frontier MoE models like Kimi-K2, Qwen3.5-397B, and MiniMax-M2.5 all use a fixed top-K rule, typically K=2 or K=4. The same number of experts is activated per token regardless of how easy or hard the token is. That is wasteful on easy tokens and possibly suboptimal on hard ones. BEAM replaces top-K with a per-token binary mask learned end-to-end during training. A regularizer keeps the average count near a target budget.

The training move uses a straight-through estimator. The forward pass uses a hard binary mask. The backward pass pretends the mask was continuous so gradients can flow. This is the same trick used in binary neural networks and quantization-aware training. The novelty here is the regularizer that holds the mean activation count near the target so the model does not collapse to "use all experts everywhere" or "use no experts anywhere." The result is quality holding above 98% at sparsity levels (up to 85% reduction in MoE FLOPs) where prior methods collapse, because prior methods either retrained the full model (expensive) or applied post-hoc thresholds at inference time (severe quality drop from the mismatch between training and inference).

The kernel is the second half of the contribution. Top-K MoE inference has the convenient property that you index into K specific experts and the index pattern is predictable. Dynamic-K means the index pattern changes per token. Naive implementations of this hit GPU memory-coalescing problems that destroy any FLOP win. BEAM's custom CUDA kernel uses a contiguous-memory layout that exploits the binary mask structure, and the vLLM integration is the production-side win.

Where does BEAM sit in the routing literature the wiki has been tracking? The wiki's running question is "where is the routing decision made." Today the answers are: at the model level (TraceR, 04-17, which builds a small classifier over query embeddings to dispatch between models), at the adapter level (MinT, 05-14, which makes a million-scale LoRA adapter catalog the routing surface), at the expert-router level (CaRE, 05-11, which adds a router above existing MoE experts for task-level routing), at the post-training latent-code level (DLR, 05-15, which jointly learns discrete latent codes and routing policies as a training objective), at the cache-eviction level (Make Each Token Count, 05-12, which learns which KV entries to drop), at the head-role level (Forcing-KV, 05-15, which compresses static-vs-dynamic attention heads differently), and now at the per-token expert-subset level (BEAM today). Plus the orthogonal profile-design axis from RouteProfile (05-15, which showed structured trainable profiles for routers beat flat domain-level ones). Eight distinct addressable layers, all unaddressed two months ago.

Why it matters: Every frontier MoE serving stack runs fixed top-K. BEAM is the first paper proposing a deployable mechanism that lets the model decide K per token, with the vLLM kernel needed to realize the FLOP win. Stack BEAM on top of NVFP4 quantization (per byte) and Forcing-KV (per head) and the inference cost of frontier MoE models drops by a multiplicative factor without any model change.

Research angle: (1) BEAM + DLR composition. DLR's discrete latent codes have shown to be causally distinct (each code drives a recognizable behavior change when ablated). If those codes drive BEAM's mask network, the per-token expert count becomes a function of the model's internal task representation. One-paper extension. (2) Train-side cost. BEAM reports inference wins but not training overhead. STE plus auxiliary regularization typically costs >20% in training time. If that holds, the deployment story shifts for frontier-scale runs. (3) BEAM for sparse-attention indexer heads. Direct transfer; the architectural shape is the same. Falsifiable: >98% retention at >50% indexer-head FLOP reduction. (4) BEAM under WildClawBench native runtime. WildClawBench (05-15) is the agent benchmark that runs models inside real Docker harnesses with actual tools, and found an 18-point spread from harness choice alone. BEAM's 98% retention is reported on standard benchmarks; whether it holds under native-runtime grading is open.

Full summary


ATESD: teacher exposure becomes a learnable control variable

Three days, three orthogonal axes of teacher-signal control in self-distillation. Extrapolation Cliff (05-14) gave a closed-form for when to distill versus when to use RL. SDAR (05-15) gated the student's absorption with a sigmoid over detached features. ATESD asks how much of the answer the teacher gets to see in the first place, and learns that exposure ratio.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. On-policy self-distillation, reasoning post-training, teacher-side control

   OPSD standard recipe          ATESD recipe
   ────────────────────          ───────────────────────────────
   Student rolls out             Student rolls out
   Teacher sees FULL reference   Beta-policy controller samples
   Teacher gives token targets   reveal ratio ∈ [0, 1]
                                  Teacher sees that fraction
                                  Teacher gives token targets

                                  hold for short window of
                                  student updates

                                  discounted learning-progress
                                  reward scores held decision by
                                  STUDENT'S FUTURE improvement,
                                  not immediate loss change

   Mismatch grows with exposure
   Full exposure not reliably    +0.95 / +2.05 / +2.33 Average@12
   the best                      over OPSD on Qwen3-{1.7B,4B,8B}
                                  AIME 24 / AIME 25 / HMMT 25

On-policy self-distillation (OPSD) is the dominant recipe for distilling reasoning ability from a strong teacher into a smaller student. The student generates a rollout (its own answer attempt), the teacher reads the reference solution along with the student's rollout, and the teacher provides token-level targets to push the student toward. Every OPSD paper the wiki has tracked assumed the teacher gets to see the full reference. ATESD ran a fixed-exposure sweep and two facts dropped out. First, full exposure is not reliably the best setting. Second, student-teacher mismatch (a measure of how aligned the teacher's token-level targets are with the student's actual probability distribution) grows monotonically as the teacher sees more privileged reasoning. The diagnosis: when the teacher reads reasoning steps far beyond the student's current competence, the targets become too strong for the student to absorb, and the student either ignores them or collapses.

The Beta-policy controller is the mechanism. A small Beta distribution is parameterized over the reveal ratio in [0, 1]. The controller observes a handful of training-state statistics, samples a reveal ratio, holds it for a short window of student updates, and the held decision is scored by a discounted learning-progress reward. The discounting matters because the immediate loss change after one decision is too noisy to credit-assign; the discounted return over the next several steps is more informative. This is the same statistical machinery that lets PPO and other RLHF policy-gradient methods stay well-defined under sparse rewards, transplanted into the distillation outer loop.

The connection to the wiki's running thread is now precise. There are three orthogonal axes of teacher-signal control. Extrapolation Cliff (05-14) is the closed-form predictor: given three observables (the student's per-token probability of the correct continuation, the upper-bound clip ratio in the PPO update, and the format-collapse threshold), there is a formula λ-star(p, b, c) above which uniform OPD breaks. The paper used this to pre-register binary predictions on Amazon Fashion data and the predictions landed in their locked windows. SDAR (05-15) is the student-side gate: a sigmoid over detached token-level features decides whether to attenuate a given teacher rejection or strengthen a positive-gap target. Used as a gated auxiliary inside multi-turn RL, SDAR delivered +9.4% on ALFWorld and similar gains on Search-QA and WebShop over GRPO (Group Relative Policy Optimization, the lightweight RL recipe most reasoning post-training pipelines now use). ATESD today is the teacher-side knob: the teacher's information advantage is modulated on the teacher side. None of the three papers references the other two. The joint composition has not been written. The natural framing is that Cliff selects the branch (whether to distill at all on this batch), ATESD tunes the teacher's exposure within OPD, and SDAR gates the student's absorption.

Why it matters: OPSD is in every modern reasoning model's training pipeline. Every one of them has been silently leaving improvement on the table by giving the teacher full reference exposure. ATESD's gains (+0.95 to +2.33 Average@12 across three model sizes on the AIME and HMMT math olympiad benchmarks) are consistent enough to suggest the effect is structural. Exposure scheduling will be a default in the next generation of distillation pipelines.

Research angle: (1) Cliff-derived closed form for optimal exposure. Is there a formula in Cliff's three observables that recovers ATESD's learned controller within 0.5 Average@12? Falsifiable. (2) ATESD + SDAR joint formulation. Orthogonal axes; the composition has not been written. (3) ATESD for cross-modal distillation. DiffusionOPD (05-15) lifted OPD into continuous-state diffusion models for text-to-image; ATESD's reveal-ratio extends naturally to image-token grouping. (4) Curriculum-effect check. Does the learned exposure trajectory look curriculum-like? SU-01 (05-15) used a reverse-perplexity curriculum on SFT data to instill proof-search behavior in a 30B-A3B reasoning model with only 200 RL steps; ATESD's controller may be discovering the analogous curriculum on the teacher side. One-figure ablation.

Full summary


LiSA: lifelong safety adaptation with posterior-gated rule reuse

Yesterday a cluster of six papers (STALE, Preping, EvolveMem, MemEye, MemLens, BOOKMARKS) made agent task memory a programmable substrate with its own learning dynamics. LiSA pulls the same architecture into agent safety memory, with a Bayesian gate that prevents the standard "rule that worked twice gets applied a hundred times" failure mode.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2. Guardrails, agent safety, memory-augmented adaptation

The framing problem LiSA addresses is the gap between two unsatisfying defaults in agent safety. Pre-deployment guardrails are brittle because they cannot encode site-local policies that only show up in real deployment. Repeated fine-tuning to address new policies is operationally infeasible. LiSA's third path is to treat the base guardrail as a substrate and bolt a structured memory layer on top of it. The memory converts sparse deployment failures into reusable policy abstractions, stores conflict-aware local rules to prevent overgeneralization, and gates reuse via an evidence-aware posterior lower bound rather than a point estimate of past accuracy.

The posterior-lower-bound gate is the technically interesting piece. Most memory-augmented safety baselines reuse a rule when its empirical past accuracy crosses a threshold. The known failure mode: a rule that has worked twice in two attempts has a point-estimate accuracy of 100% and gets reused everywhere, until the failure shows up on the third or thirtieth context that was actually different. A Bayesian posterior with evidence-aware confidence solves this by gating reuse on the lower bound of the credible interval, not the mean. A rule needs accumulated independent uses before its lower bound rises enough to graduate to high-confidence application. Reported results on three safety benchmarks (PrivacyLens+ on data-leakage scenarios, ConFaide+ on confidentiality reasoning, AgentHarm on tool-misuse): consistent outperformance under sparse feedback, 20%-label-flip robustness, and latency-performance frontier extension beyond what backbone scaling delivers.

The cross-paper thread is the safety-side mirror of yesterday's harness-as-load-bearing finding. WildClawBench (05-15) showed an 18-point spread between harnesses running the same model on the same tasks. LiSA suggests that guardrails are no longer monolithic per-deployment artifacts; they accumulate context across reports and have their own learning curve. The wrapper around any single model is now two programmable substrates: the eval and the safety layer, both with their own learning dynamics.

Why it matters: As agents move into tool-using deployments, guardrail failures become concrete operational harms (leaked secrets, unauthorized actions, regulatory exposure). LiSA is the first paper in the wiki that treats safety-rule reuse as an evidence-accumulation problem with proper Bayesian gating, rather than as a memory-augmented classifier. The posterior-lower-bound idea is portable and will likely appear in other places.

Research angle: (1) LiSA + AgentLens. AgentLens (05-14) is the process-aware labeling system that found 10.7% of SWE-bench Verified passes are Lucky Passes, where the right answer falls out for the wrong reasons. Apply AgentLens to LiSA-accepted decisions: how many "safe" passes are Lucky? Falsifiable: a follow-up reporting this fraction with and without the posterior gate. (2) Federated LiSA across organizations. Memory-as-policy is also memory-as-leak. Cross-org rule transfer with privacy preservation is the obvious extension. (3) LiSA composed with Ken Huang's continuous adversarial validation. The Mythos piece in this week's Gmail-starred batch describes a continuously learning offensive policy library (red-team rules accumulate the same way LiSA's blue-team rules do); the paired offensive-defensive system has not been built.

Full summary


FrontierSmith and SPIN: open-ended coding data, and DAG-validated planning

Two practical agent papers in one batch. FrontierSmith generates open-ended coding training data from closed-ended seeds, with an idea-divergence filter that catches the usual mode-collapse failure. SPIN wraps the planner with a DAG (directed acyclic graph) contract and stops as soon as a partial plan answers the query. SPIN ships measurable improvement on AssetOpsBench, the same operations-agent benchmark that on 05-14 produced a -0.13 correlation between leaderboard score and the hidden "Accomplished" metric.

Sources: HuggingFace Daily Papers Links: FrontierSmith paper · FrontierSmith wiki · SPIN paper · SPIN wiki Tier: 2. Synthetic training data, agent planning, cost control

FrontierSmith is the third paper this week where the model writes its own training substrate. EvoEnv (05-15) constructed verifiable RL environments by generating Python programs that sample instances, compute references, and score responses, where the structural invariant is the solve-verify asymmetry (the model can write a verifier once that it cannot reliably execute by reasoning in natural language). EvolveMem (05-15) self-evolved retrieval configuration from per-question failure logs and improved LoCoMo (a long-context memory benchmark) scores by +25.7% over the strongest baseline. FrontierSmith evolves training problems from a closed-ended seed corpus by mutating goals, restricting outputs, and generalizing inputs, then pruning to high-divergence variants via a quantitative idea-divergence metric that catches near-duplicates. Agents then generate test cases and verifiers for the survivors. Qwen3.5-9B gains +8.82 on FrontierCS and +306.36 Elo on ALE-bench (a competitive-programming arena scored by Elo against other models); Qwen3.5-27B gains +12.12 and +309.12. The idea-divergence filter does the same shape of work as EvoEnv's solve-verify asymmetry: a quantitative diversity prior is the load-bearing trick, not the generation step itself.

SPIN is the deployment-side planning wrapper that addresses the brittleness of free-form LLM plans. It runs in two stages. First, the planner's output is forced into a strict DAG (directed acyclic graph) contract using a validate-and-repair prompting cycle, before any execution starts. Second, the DAG is evaluated prefix-by-prefix and execution stops the moment the current prefix already answers the query. On AssetOpsBench, the operations-agent benchmark with 261 scenarios: total executed tasks drop from 1061 to 623, the hidden Accomplished metric rises from 0.638 to 0.706, and average tool calls per run drop from 11.81 to 6.82. This is the same benchmark that on 05-14 surfaced a -0.13 correlation between the public leaderboard accuracy and the hidden Accomplished metric, meaning higher leaderboard scores correlated with worse actual accomplishment. SPIN is the first published wrapper improvement on AssetOpsBench. Whether SPIN closes that -0.13 gap (the load-bearing question) or just improves both numbers in parallel is unanswered.

Why it matters: FrontierSmith addresses the "where does open-ended coding training data come from" question that has been a soft constraint on agentic post-training. SPIN addresses the "can a structured planning wrapper provide cheap wins on a leaderboard that already exists" question. Both are practical, both ship code-level mechanisms, and both fit into the agentic stack that converged yesterday (Orchard for training infrastructure, SDAR for stable multi-turn RL, EvoEnv for verifiable environments, WildClawBench for native-runtime evaluation).

Research angle: (1) FrontierSmith + EvoEnv composition. Problem synthesis with a built-in solve-verify check. Untested. (2) SPIN under WildClawBench native runtime. The 18-point harness-sensitivity number suggests SPIN's improvement may shift significantly when the harness changes. (3) Idea-divergence beyond coding. The metric is domain-general; transfer to math, scientific discovery, and agentic workflows is open.

FrontierSmith wiki · SPIN wiki


Industry Pulse


Connecting the Dots

   Today's research (HF + Kurate + RSS)         Today's industry + social-stream
   ──────────────────────────────────────       ──────────────────────────────────────

   Training-time substrate selection:           Anthropic $900B (above OpenAI)
     Lighthouse Attention (pre-training)         ▲
     BEAM (MoE expert masks)                    │  agent-CLI as value capture:
     ATESD (teacher exposure)                   │   Microsoft pulls Claude Code
            │                                    │   GitHub Copilot CLI
            ▼                                    │   OpenAI Codex mobile
   Inference-stack substrate updates:           │   Grok Build CLI (confirmed)
     (05-15) Forcing-KV (head-role)             │
     (05-15) async continuous batching          │   harness-as-load-bearing
     (05-15) NVFP4 Kimi-K2.6                    │   thread (WildClawBench 05-15)
            │                                    │
            ▼                                    │
   Self-substrate synthesis:                    ▼
     (05-15) EvoEnv (RL envs)             Subquadratic Appen 56.2x@1M (Gmail 05-15)
     (05-15) EvolveMem (retrieval cfg)      + Lighthouse Attention (today)
     (today)  FrontierSmith (coding data)  = subquadratic-train, dense-deploy
                                             confirmed by two independent groups
   Agent safety-as-memory:
     (today) LiSA (posterior gate)       OpenAI Codex mobile
                                           + GitHub Copilot REST API
   Agent planning wrapper:               = agent work as programmable
     (today) SPIN (DAG validator)          queue, not manual session
       AssetOpsBench 0.638 → 0.706

Cross-paper thread #1: the uniform-default reform now crosses ten papers and eight stack layers. The pattern started two months ago and is now too consistent to ignore. LongAct (04-18) showed that long-context training-signal density is concentrated in the first 5% of tokens, so uniform gradient updates are wasteful and selective gradients should replace them. TIP (04-16) and Make Each Token Count (04-16, with a follow-up paper on KV cache eviction on 05-12) showed that only about 10% of teacher tokens in distillation carry real signal, so uniform token weighting is wasteful and selective weighting should replace it. Make Each Token Count (05-12) extended the same argument to the KV cache (the memory store that saves prior attention computations), proposing learned eviction policies rather than uniform retention. The Extrapolation Cliff (05-14) derived a closed-form threshold λ-star above which uniform on-policy distillation collapses, replacing the "always distill" default with a "distill only when the math says it is safe" rule, and pre-registered binary predictions on Amazon Fashion data that landed in their locked windows. SDAR (05-15) showed that uniform OPSD gating destabilizes inside multi-turn RL, so a sigmoid over detached features should gate the student's absorption selectively. Today adds three more layers. Lighthouse Attention rejects uniform pre-training attention. BEAM rejects uniform top-K MoE routing. ATESD rejects uniform teacher exposure. Ten papers, eight layers (gradient, token weight, cache eviction, distillation branch, student gate, pre-training attention, MoE expert set, teacher exposure). Any remaining "treat every X equally" default in this stack is now the obvious next target.

Cross-paper thread #2: subquadratic-train, dense-deploy is cross-source confirmed. Subquadratic, a closed-source long-context lab, announced an Appen-independently-validated 56.2x speedup over FlashAttention-2 at 1M tokens and 81.8% on SWE-bench Verified (surfaced via Gmail-starred on 05-15). Lighthouse Attention from Nous Research (today, also retweeted by @bayesiansapien from the @NousResearch announcement on the evening of 05-15) is the open-source training-side counterpart. Two different research groups, same week, same axis, similar headline structure (train with a subquadratic mechanism, ship a model that deploys with dense attention). The wiki has now seen both the proprietary deployment numbers and the open-source pre-training recipe. The joint reproduction (a Lighthouse-trained model with Subquadratic-class inference numbers) is the obvious next experiment.

Cross-paper thread #3: the routing surface now has eight internal layers plus an orthogonal profile axis. BEAM today adds the per-token expert-subset axis. The layers already established are: model-level (TraceR, 04-17, query-embedding classifier for inter-model dispatch), adapter-level (MinT, 05-14, million-scale LoRA catalog as the routing surface), expert-router-level (CaRE, 05-11, router-above-experts for task-level routing), training-time latent-code-level (DLR, 05-15, joint discrete codes + routing policy + model parameters as one training objective), cache-eviction-level (Make Each Token Count, 05-12, learned KV eviction policy), head-role-level (Forcing-KV, 05-15, static-vs-dynamic head split for video diffusion cache compression), distillation-loss-level (SDAR, 05-15, gated OPSD over detached signals), and now per-token expert-set-level (BEAM today). The orthogonal axis is RouteProfile (05-15): structured trainable profiles describing candidate models beat flat domain-level descriptions on generalization to newly added models. The composition that has not been written: BEAM masks consume DLR latent codes consume CaRE task routers consume RouteProfile-structured profiles. A vertically integrated routing system spanning all four layers is one paper away.

Cross-paper thread #4: agent eval improvement on the same benchmark that surfaced the measurement crisis. AssetOpsBench reported a -0.13 correlation between the public-leaderboard accuracy metric and the hidden "Accomplished" metric on 05-14. The public number was rewarding the wrong behavior. SPIN today reports Accomplished rising from 0.638 to 0.706 on AssetOpsBench, with average tool calls dropping by ~42% per run. Whether SPIN closes the accuracy-Accomplished gap (the load-bearing question) or just improves both numbers in parallel is unanswered. Falsifiable in one follow-up.

Cross-paper thread #5: self-substrate synthesis is now a three-paper cluster. EvoEnv (05-15) generates verifiable RL environments where the solve-verify asymmetry is the structural invariant (the model can write a verifier once that it cannot reliably execute in natural language on fresh instances). EvolveMem (05-15) generates retrieval configurations from failure logs and improves LoCoMo by +25.7% relative. FrontierSmith (today) generates open-ended training problems with an idea-divergence quality filter to catch near-duplicates. Three independent papers, same architectural shape: a quantitative diversity prior (solve-verify asymmetry, AutoResearch-style diagnosis on failure logs, idea-divergence) is doing the load-bearing work, not the generation step. Pattern threshold of three crossed; this is now a cluster.

Cross-paper thread #6: memory-as-substrate extends from task memory into safety memory. Yesterday's six-paper agent-memory cluster (STALE at a 55.2% ceiling on implicit-conflict detection, EvolveMem auto-evolving retrieval configuration, Preping building memory before tasks for 2-3x lower deployment cost, MemEye and MemLens both showing multi-session multimodal capped below 30%, BOOKMARKS on storyline memory for role-play) made agent task memory a programmable substrate with its own learning dynamics. LiSA today imports the same architectural treatment into agent safety. The LiSA-specific contribution is the Bayesian posterior lower bound gating rule reuse, where evidence accumulation across deployments matters more than point-estimate accuracy on past traces. Pair this with the Ken Huang Mythos piece from this week's Gmail-starred batch, which describes continuous adversarial validation (Claude Mythos hit 83% first-attempt exploit success and found a 27-year-old OpenBSD bug in pre-release testing). Same architectural diagnosis (memory-as-policy-library, continuous evidence accumulation), opposite stance (defense versus offense).

Cross-paper thread #7: industry value capture moves to the agent CLI layer. Anthropic at $900B with Microsoft pulling Claude Code licenses on the same day is the same event from two angles. Microsoft's reversal is not a quality judgment, it is a strategic one: if 18 points of model performance lives in the harness (WildClawBench, 05-15), then harness ownership is the moat. Microsoft confirmed this with a procurement decision 48 hours after the WildClawBench paper landed. OpenAI Codex mobile, GitHub Copilot CLI, Grok Build, and Google Gemini CLI all on the same week draw the same lesson: model API is a commodity, agent harness is where users live, and labs are racing to own the surface.

Media-Live morning slot (2026-05-16): see morning synthesis. The strongest @bayesiansapien retweet batch in a week. Fourteen retweets, three of which directly amplify today's HuggingFace batch: the Nous Research Lighthouse Attention announcement, the "Is Grep All You Need?" paper which finds that grep-style text search inside the right coding-agent harness matches or beats embedding retrieval (a direct fit for the harness-as-load-bearing thread), and a two-paper mechanistic-interpretability cluster arguing that the standard assumption of a unique circuit per LLM task is wrong. The AI handle feed is thin (ClaudeDevs rate-limit reset, NVIDIA brand-marketing Catalyst series, WHFraudTF off-topic political content).

Yesterday afternoon and evening slot recap: the afternoon slot carried three retweets, the most substantive being the agentic-AI-as-AGI-path position paper (arXiv 2605.12966) which formalizes agency as routing across memory, reasoning, tool use, self-improvement, and alignment (directly relevant to today's BEAM + ATESD + FrontierSmith cluster). The evening slot had no @bayesiansapien retweets; @nottombrown surfaced Anthropic CFO Krishna Rao's first podcast (>500% net dollar retention, 90% of internal Anthropic code written by Claude Code, run-rate growth from $9B to $30B in one quarter) which is the direct source for the $900B-valuation news landing today.


Worth Watching


Quick Hits

OmniBoost / OmniClean (arXiv 2605.12034). Omni-modal LLMs (audio + image + text + video) are quietly inflating gains via visual shortcuts in benchmarks. The authors audit 9 omni-modal benchmarks, run visual-only probing, drop visually solvable queries, and build OmniClean (8,551 retained from 16,968 originals). On OmniClean, a three-stage post-training recipe (mixed bi-modal SFT, mixed-modality RLVR, SFT on self-distilled data) lifts a 3B Qwen2.5-Omni to match a 30B Qwen3-Omni-A3B without using a stronger omni-modal teacher. Tier 3 vision; useful as a debiased-eval template for any modality.

WildTableBench (arXiv 2605.01018). 402 high-density real-world table images, 928 questions, 21 frontier multimodal models tested. Only one model crosses 50% accuracy. Structural perception and numerical reasoning are the persistent weaknesses. Continues the eval-ceiling pattern that has held for four consecutive days: every new honest measurement lens lowers the previously-reported ceiling.

FEST (arXiv 2605.15012). Few-shot demonstration-guided RLVR. 128 demonstrations randomly selected from an SFT dataset suffice when combined with on-policy signal and decaying weights. Matches full-dataset SFT-then-RLVR with orders of magnitude less data. Tier 2 with implications for cheaper RLVR pipelines.

LC-MAPF (arXiv 2605.07637). Local communication module for multi-agent pathfinding via learnable multi-round message exchange between neighboring agents. Outperforms IL/RL-only solvers and preserves scalability (typically the bottleneck of communication-based MAPF). Tier 4 robotics.

IntentVLA (arXiv 2605.14712). History-conditioned Vision-Language-Action framework: encode recent visual observations into a compact short-horizon intent representation, condition the action chunk on it. Solves the observation-aliasing problem where frame-conditioned VLAs resample inconsistent intents across replanning steps. Introduces AliasBench (12-task RoboTwin2 benchmark). Tier 4 robotics.

Pace-and-Path Correction (arXiv 2605.11459). Training-free closed-form inference-time operator for chunked-action VLAs. Decomposes into a pace channel (compress along planned direction) and a path channel (orthogonal spatial offset). +28.8% and +25.9% absolute success-rate over foundational VLA models on MoveBench. Tier 4 robotics; the closed-form structure is similar to other training-free wrappers landing this week.

PanoWorld / PhyMotion / Realiz3D / SAT3DGen / VGGT-Edit. 3D and panoramic world models for spatial reconstruction. Tier 4; skip.

ViMU (arXiv 2605.15188). Video metaphorical understanding benchmark. Tier 3 multimodal.

PRISM (arXiv 2605.15182). Prior-rectification and uncertainty-aware structure modeling for depth estimation. Tier 4.

LLM-Based Detection of Manipulative Political Narratives (arXiv 2605.14354). Reasoning-model filter + UMAP + HDBSCAN over 1.2M social-media posts identifies 41 manipulative narrative clusters. Adjacent to the responsible-ai thread. The LLM-as-classifier framing has known fragilities, and today's arXiv enforcement-tightening news is the calibration signal.

Ideology Prediction of German Political Texts (arXiv 2605.14352). DeBERTa-large achieves 0.844 F1 on the political-spectrum projection task; out-of-domain X-Twitter test ACC 0.864. Tier 4, included for the responsible-ai cluster on LLMs in political analysis.

Does Synthetic Layered Design Data Benefit Layered Design Decomposition? (arXiv 2605.15167). Pure synthetic data beats partially-real PrismLayersPro for graphic-design decomposition. Saturation at ~50K samples. Tier 4 graphic design.

PreScam (arXiv 2605.12243). Scam progression benchmark from real-world reports; 177,989 reports filtered to 11,573 conversational scam instances. Reasoning-model labelers struggle on progression versus static scam detection. Tier 3 responsible-ai.

Algorithmic Bridge Weekly Top Picks #121 (Substack). Notable callouts in this week's roundup: "The hottest job in AI pays $630K and it's not building models" (sales engineering and Forward Deployed Engineer pattern continues), Trump-Xi AI safety talks in Beijing, Andrew Ng on no AI jobpocalypse, frontier models fixing benchmarks instead of solving them (continues the eval-ceiling thread), 70% American opposition to AI datacenters near them (Gallup; pairs with Cerebras IPO timing), Claude 4 rebutting Claude 3's case for AI consciousness.

Gary Marcus on US AI policy chaos (Marcus on AI via Gmail-starred). Marcus + Sonnenfeld + Henriques essay in Fortune: roughly 1,200 AI bills introduced, ~150 enacted, no coherent framework. Argues for a structured "which questions get asked, in what order" approach. Pairs with arXiv's enforcement crackdown today: 2026 is the year of attempted AI-policy correction, arriving mostly as patchwork.

Ken Huang on automated security validation (Mythos) (Agentic AI Substack via Gmail-starred). Detailed framing of Claude Mythos (Anthropic's April 2026 vulnerability-discovery system, which hit 83% first-attempt exploit success and found a 27-year-old OpenBSD bug in pre-release testing) and the structural collapse of the time-to-exploit window (771 days in 2018 to sub-hourly in 2024). Argument: continuous automated security validation closes the offense-defense gap that AI-driven attackers widen. Cross-pairs with today's LiSA: same diagnosis (memory-as-policy-library, continuous evidence accumulation), opposite stance (defense versus offense).

Simon Willison: iNaturalist clumper 0.1 (blog). Side-project tooling release; no AI bearing. Skip.

Reddit highlights:


Sources ingested today: HF (27 new papers, total 53 for the 2026-05-15 window), RSS (13 new posts dated 2026-05-15: 7 The Decoder, 1 Marcus on AI, 1 Algorithmic Bridge, 1 Ken Huang agentic-ai, 2 Simon Willison, 1 unattributed), Gmail (1 starred set: Marcus, Ken Huang Mythos, AI Breakfast 05-15 with Codex Mobile, Cerebras IPO, Anthropic-Gates, OpenAI-Apple), Twitter morning slot (21 tweets, 14 retweets, 11 articles) plus 2026-05-15 afternoon (8 tweets, 3 retweets) plus 2026-05-15 evening (4 tweets, 0 retweets), Kurate cs.AI + cs.LG weekly leaderboards (no rising authors crossed threshold, no HF+Kurate paper overlap this run because Kurate's window is April while HF's is May), Reddit (8 subs scraped: LocalLLaMA 8, MLScaling 5, LLMDevs 1, ControlProblem 3, CUDA 1, HPC 0, MachineLearning 0, reinforcementlearning 0), parallel Daily-Digest (none for 2026-05-16, latest is still 2026-04-23) | Wiki pages updated: 9 (6 new summaries: Lighthouse, BEAM, ATESD, LiSA, FrontierSmith, SPIN; 3 concept updates: kv-cache.md, knowledge-distillation.md, llm-routing.md)