cere-bro | 2026-05-16
Three Tier 1 papers in one batch say the same thing about three different layers of the stack. The training loop's "give it all the information uniformly" default is the new wasteful baseline. The reform is to schedule attention, experts, and teachers selectively. On the industry side, Anthropic crossed OpenAI's valuation while Microsoft pulled its internal Claude Code licenses on the same day, because the value capture point in AI is moving from the model API to the agent harness that wraps it.
TL;DR
- Lighthouse Attention (Nous Research, arXiv 2605.06554) is a wrapper that lives only during training. It pools queries, keys, and values into a small pyramid, picks a short sub-sequence with a cheap top-k pass, then calls a standard FlashAttention kernel on that short input. The model ships as plain dense attention. Result: 1.4 to 1.7 times faster training at 98K context, about 17 times faster forward and backward at 512K on a single B200 GPU. This is the first open-source training recipe to match what Subquadratic (a closed lab) announced on 05-15 as 56.2x at 1M tokens with 81.8% on SWE-bench Verified.
- BEAM (arXiv 2605.14438) lets each input token decide its own expert subset inside a Mixture-of-Experts model (MoE means each token only activates a small slice of a much bigger network). Instead of a fixed top-K rule, BEAM trains a binary mask end-to-end. It ships with a custom CUDA kernel and a vLLM integration so the FLOP savings translate into real wall-clock wins. 98%+ quality retention, up to 85% reduction in MoE compute, 2.5x decoding speed, 1.4x throughput.
- ATESD (arXiv 2605.11458) is the third paper in three days reforming on-policy distillation. In on-policy distillation, a student model generates a rollout and a teacher model scores or labels it. ATESD asks how much of the reference answer the teacher gets to see before scoring, and learns this exposure ratio with a small Beta-distribution controller. Across three Qwen3 sizes (1.7B, 4B, 8B), the trained controller beats the default "show the teacher everything" setting by +0.95, +2.05, and +2.33 points averaged across the AIME 24, AIME 25, and HMMT 25 math olympiad benchmarks.
- LiSA (arXiv 2605.14454) turns guardrails into a memory layer. It accumulates safety rules from real deployments and gates rule reuse with a Bayesian posterior lower bound, which prevents the standard failure where a rule that worked twice gets applied a hundred times before someone catches that the third case was different. Outperforms strong memory baselines under sparse and 20%-label-flipped feedback on PrivacyLens+, ConFaide+, and AgentHarm.
- FrontierSmith (arXiv 2605.14445) mutates closed-ended competitive-programming problems into open-ended training variants, then filters them with an idea-divergence metric so the model is not just generating slight rephrasings. Qwen3.5-9B gains +8.82 on the FrontierCS benchmark and +306.36 Elo on ALE-bench, a competitive-programming arena. This is the third paper in a week where the model writes its own training substrate.
- SPIN (arXiv 2605.14051) wraps an agent's planner with a DAG (directed acyclic graph) contract validator and stops the agent as soon as a partial DAG already answers the user's query. On AssetOpsBench, the operations-agent benchmark that on 05-14 reported a -0.13 correlation between public-leaderboard accuracy and the hidden "Accomplished" metric, SPIN raises Accomplished from 0.638 to 0.706 and cuts tool calls from 11.81 to 6.82 per run. First published wrapper improvement on that benchmark.
- Industry pulse. Anthropic is reportedly raising $30B at a $900B valuation that for the first time exceeds OpenAI's (The Decoder). Microsoft pulled internal Claude Code licenses and is pushing developers back to its own GitHub Copilot CLI (The Decoder). OpenAI shipped Codex on iOS and Android, turning a coding agent into a mobile work queue you steer from your phone (The Decoder). Anthropic announced a $200M four-year partnership with the Gates Foundation (Anthropic). Cerebras opened public trading 89% above its IPO price (Reuters).
The Big Picture
A two-month pattern crystallized today, and the easiest way to see it is to walk back the dates. On 04-18, LongAct showed that long-context training-signal density is concentrated in roughly the first 5% of tokens, so most of the gradient update is wasted on the rest. On 04-16, TIP and Make Each Token Count showed something similar at the loss layer: only about 10% of teacher tokens in distillation carry real signal, so weighting all of them uniformly leaves accuracy on the table. On 05-12, Make Each Token Count generalized that argument to the KV cache (the memory store that saves prior attention computations to avoid recomputing them) and proposed learned eviction rather than uniform retention. On 05-14, the Extrapolation Cliff paper derived a closed-form threshold above which uniform on-policy distillation collapses, replacing the "always distill" default with a "distill only when the math says it is safe" rule. On 05-15, SDAR gated the student's absorption of teacher signal with a sigmoid over detached token-level features. Seven papers across two months, each rejecting one specific "treat every X uniformly" default.
Today adds three more layers. Lighthouse Attention rejects uniform pre-training attention by pooling queries, keys, and values into a multi-resolution pyramid and routing through a tiny dense sub-sequence. BEAM rejects uniform top-K Mixture-of-Experts routing by letting each token learn its own expert subset via a trained binary mask. ATESD rejects uniform teacher exposure in self-distillation by making the reveal ratio a learnable control variable scored by a discounted learning-progress reward. The diagnosis is now identical across ten papers and eight layers of the stack: the gradient layer, the token weighting layer, the cache eviction layer, the distillation branch-selection layer, the student gating layer, the pre-training attention layer, the MoE expert-set layer, and the teacher exposure layer. Any "every X gets equal Y" default that still survives in this stack is now the obvious target for the next paper.
The second thread is the deployment substrate convergence. Lighthouse Attention on the open-source training side and Subquadratic's Appen-validated 56.2x speedup at 1M tokens (an independent benchmark validation from a closed lab, reported via Gmail-starred on 05-15) on the production side are arriving in the same week with the same headline structure. NVIDIA's NVFP4 quantization release for Kimi-K2.6 (05-15, an open-weights 4-bit format that takes the model down to fewer bits per parameter without losing much accuracy) handles the per-byte axis. BEAM handles the per-token expert-set axis. Forcing-KV from 05-15 (head-role-conditioned KV cache compression for video diffusion) handles the per-head cache axis. The asynchronous continuous-batching primitive from 05-15 (which overlaps CPU prep of batch N+1 with GPU compute of batch N) handles the scheduling axis. Five independent improvements to the inference and training stack in seven days. None are model changes. The composition multiplies. Nobody has run the joint experiment, but the back-of-envelope number is a 5-10x throughput improvement on the same hardware in 2026 over what was available in 2025, without any new architecture.
The third thread is industry rebalancing. Anthropic crossing OpenAI's valuation, raising another $30B for compute, and simultaneously losing Microsoft as an internal Claude Code customer is not three separate stories. It is one story from three angles. WildClawBench on 05-15 measured an 18-point spread between the worst and best agent harness running the same underlying model on the same 60 long-horizon benchmark tasks. The harness is doing more work than the model. Microsoft just decided that if 18 points of performance lives in the harness, owning the harness is strategic and renting it from a competitor is not. Five frontier labs now run their own coding-agent CLI (Anthropic's Claude Code, OpenAI's Codex with a fresh mobile app, xAI's Grok Build CLI confirmed by The Decoder this morning, Google's Gemini CLI in private beta, and Microsoft's GitHub Copilot CLI). The model API is a commodity. The agent harness is where the user lives and where the spend lands.
Deep Dives
Lighthouse Attention: training-only, kernel-decoupled long-context pre-training
A wrapper that lives only during training, runs no custom kernel, requires no auxiliary loss, and removes itself before the model ships. 1.4 to 1.7 times faster at 98K context, about 17 times faster forward and backward at 512K on a single B200 GPU.
Source: HuggingFace Daily Papers · @NousResearch retweet (2026-05-15) · r/MLScaling discussion Links: Paper · Code · Wiki Tier: 1. Long-context pre-training, GPU efficiency, kernel-decoupled attention
Standard SDPA at 512K Lighthouse Attention at 512K
──────────────────── ───────────────────────────────
Q ─► pool symmetrically ─► Q'
O(N²) K ─► pool symmetrically ─► K'
FlashAttention on full V ─► pool symmetrically ─► V'
N × N attention matrix
score every pyramid head
top-k cascade picks hierarchical
dense sub-sequence
sort to preserve left-to-right
feed Q', K', V' through ordinary
FlashAttention on the short
selected sub-sequence
short recovery phase removes
wrapper. Model ships as standard
dense-attention.
1.0x training speed 1.4-1.7x at 98K, ~17x at 512K
(forward+backward, single B200)
The structural novelty is in what Lighthouse refuses to do. It does not replace softmax attention with a state-space machine (SSMs and linear-attention variants are alternative sequence mixers that scale better than O(N²) but produce different inductive biases). It does not write a custom sparse kernel that the GPU vendor's flagship attention kernel cannot reuse. It does not require a straight-through estimator (the trick where you backpropagate through a non-differentiable operation by pretending it is the identity), and it does not need an auxiliary loss. It is a wrapper that intercepts the attention call, runs a cheap pyramid-pooling pass plus a top-k selection, and then calls standard FlashAttention on a much shorter input. Toward the end of training it stops doing even that, and the model is left as a plain dense-attention checkpoint.
The pooling is symmetric across Q, K, and V. That is the move that earlier selective-attention work declined to make. Most prior designs pooled keys and values into a compressed memory store but kept queries at full resolution, treating the cache as an addressable index that the queries probe. Lighthouse pools all three so the queries themselves carry hierarchical structure. This is what lets the gradient-free top-k cascade learn something more interesting than a memory index. The selection becomes structural. The model learns to make queries at multiple resolutions and the cascade picks the appropriate level.
The training-only framing has an unexpected interpretability hook. Each pyramid-head selection is a routing decision the model implicitly learned. Whether those decisions correlate with content boundaries (entities, document breaks, syntax phases) is the open question. If they do, Lighthouse is a self-supervised structural prior for free, in addition to being a speedup.
Why it matters: Two weeks ago the long-context pre-training conversation had two options. Either replace softmax with a state-space model, or write a custom sparse kernel and hope the GPU vendor optimizes it. Lighthouse opens a third path: leave the kernel alone, leave the architecture alone, just change what gets fed into the kernel. The empirical numbers are still small-scale, but the design pattern is much harder to argue with. The same week, Subquadratic (a closed lab) announced an independently-Appen-validated 56.2x speedup at 1M tokens with 81.8% on SWE-bench Verified. Two groups, same axis, same week. The proprietary deployment side and the open-source pre-training side are converging on the same recipe.
Research angle: Four open problems. (1) Lighthouse at 1B+ parameters. Nous's experiments are small-scale; the recovery-phase scaling is the load-bearing experiment. Falsifiable: a paper that runs Lighthouse at 1B+ parameters and shows the recovery-phase loss matches full-attention training on equivalent tokens. (2) Lighthouse for long-context fine-tuning. Most production deployments need long-context supervised fine-tuning, not pre-training. The wrapper should transfer naturally. Untested. (3) Top-k cascade interpretability. If pyramid heads correlate with content boundaries, Lighthouse is also a structural prior. (4) Lighthouse plus Mamba hybrid. SANA-WM (05-15) showed hybrid Mamba plus softmax works for video diffusion; Lighthouse plus Mamba alternating layers is the obvious composition.
BEAM: binary expert activation masking ships with a vLLM kernel
Each token learns its own expert subset inside a Mixture-of-Experts model, replacing the fixed top-K rule with a trainable binary mask. 98%+ quality retention at sparsity levels (up to 85% expert-FLOP reduction) where previous methods collapse. The vLLM kernel is what turns the paper into a deployment instead of a curiosity.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. MoE routing, dynamic sparsity, inference efficiency
Standard Top-K MoE BEAM (binary mask per token)
────────────────── ──────────────────────────────
token ─► router ─► top-K token ─► gating head ─► binary
experts mask over experts
(K fixed, e.g. 2) (count adapts; STE
carries gradients;
aux regularizer
enforces budget)
redundant compute on
easy tokens end-to-end training induces
dynamic sparsity (no
suboptimal on hard train-inference mismatch)
tokens
↓
custom CUDA kernel + vLLM
>98% retention
up to 85% MoE FLOP reduction
2.5x decoding speedup
1.4x throughput
A Mixture-of-Experts (MoE) model lets each input token activate only a small subset of specialized sub-networks rather than the full feed-forward block. Frontier MoE models like Kimi-K2, Qwen3.5-397B, and MiniMax-M2.5 all use a fixed top-K rule, typically K=2 or K=4. The same number of experts is activated per token regardless of how easy or hard the token is. That is wasteful on easy tokens and possibly suboptimal on hard ones. BEAM replaces top-K with a per-token binary mask learned end-to-end during training. A regularizer keeps the average count near a target budget.
The training move uses a straight-through estimator. The forward pass uses a hard binary mask. The backward pass pretends the mask was continuous so gradients can flow. This is the same trick used in binary neural networks and quantization-aware training. The novelty here is the regularizer that holds the mean activation count near the target so the model does not collapse to "use all experts everywhere" or "use no experts anywhere." The result is quality holding above 98% at sparsity levels (up to 85% reduction in MoE FLOPs) where prior methods collapse, because prior methods either retrained the full model (expensive) or applied post-hoc thresholds at inference time (severe quality drop from the mismatch between training and inference).
The kernel is the second half of the contribution. Top-K MoE inference has the convenient property that you index into K specific experts and the index pattern is predictable. Dynamic-K means the index pattern changes per token. Naive implementations of this hit GPU memory-coalescing problems that destroy any FLOP win. BEAM's custom CUDA kernel uses a contiguous-memory layout that exploits the binary mask structure, and the vLLM integration is the production-side win.
Where does BEAM sit in the routing literature the wiki has been tracking? The wiki's running question is "where is the routing decision made." Today the answers are: at the model level (TraceR, 04-17, which builds a small classifier over query embeddings to dispatch between models), at the adapter level (MinT, 05-14, which makes a million-scale LoRA adapter catalog the routing surface), at the expert-router level (CaRE, 05-11, which adds a router above existing MoE experts for task-level routing), at the post-training latent-code level (DLR, 05-15, which jointly learns discrete latent codes and routing policies as a training objective), at the cache-eviction level (Make Each Token Count, 05-12, which learns which KV entries to drop), at the head-role level (Forcing-KV, 05-15, which compresses static-vs-dynamic attention heads differently), and now at the per-token expert-subset level (BEAM today). Plus the orthogonal profile-design axis from RouteProfile (05-15, which showed structured trainable profiles for routers beat flat domain-level ones). Eight distinct addressable layers, all unaddressed two months ago.
Why it matters: Every frontier MoE serving stack runs fixed top-K. BEAM is the first paper proposing a deployable mechanism that lets the model decide K per token, with the vLLM kernel needed to realize the FLOP win. Stack BEAM on top of NVFP4 quantization (per byte) and Forcing-KV (per head) and the inference cost of frontier MoE models drops by a multiplicative factor without any model change.
Research angle: (1) BEAM + DLR composition. DLR's discrete latent codes have shown to be causally distinct (each code drives a recognizable behavior change when ablated). If those codes drive BEAM's mask network, the per-token expert count becomes a function of the model's internal task representation. One-paper extension. (2) Train-side cost. BEAM reports inference wins but not training overhead. STE plus auxiliary regularization typically costs >20% in training time. If that holds, the deployment story shifts for frontier-scale runs. (3) BEAM for sparse-attention indexer heads. Direct transfer; the architectural shape is the same. Falsifiable: >98% retention at >50% indexer-head FLOP reduction. (4) BEAM under WildClawBench native runtime. WildClawBench (05-15) is the agent benchmark that runs models inside real Docker harnesses with actual tools, and found an 18-point spread from harness choice alone. BEAM's 98% retention is reported on standard benchmarks; whether it holds under native-runtime grading is open.
ATESD: teacher exposure becomes a learnable control variable
Three days, three orthogonal axes of teacher-signal control in self-distillation. Extrapolation Cliff (05-14) gave a closed-form for when to distill versus when to use RL. SDAR (05-15) gated the student's absorption with a sigmoid over detached features. ATESD asks how much of the answer the teacher gets to see in the first place, and learns that exposure ratio.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. On-policy self-distillation, reasoning post-training, teacher-side control
OPSD standard recipe ATESD recipe
──────────────────── ───────────────────────────────
Student rolls out Student rolls out
Teacher sees FULL reference Beta-policy controller samples
Teacher gives token targets reveal ratio ∈ [0, 1]
Teacher sees that fraction
Teacher gives token targets
hold for short window of
student updates
discounted learning-progress
reward scores held decision by
STUDENT'S FUTURE improvement,
not immediate loss change
Mismatch grows with exposure
Full exposure not reliably +0.95 / +2.05 / +2.33 Average@12
the best over OPSD on Qwen3-{1.7B,4B,8B}
AIME 24 / AIME 25 / HMMT 25
On-policy self-distillation (OPSD) is the dominant recipe for distilling reasoning ability from a strong teacher into a smaller student. The student generates a rollout (its own answer attempt), the teacher reads the reference solution along with the student's rollout, and the teacher provides token-level targets to push the student toward. Every OPSD paper the wiki has tracked assumed the teacher gets to see the full reference. ATESD ran a fixed-exposure sweep and two facts dropped out. First, full exposure is not reliably the best setting. Second, student-teacher mismatch (a measure of how aligned the teacher's token-level targets are with the student's actual probability distribution) grows monotonically as the teacher sees more privileged reasoning. The diagnosis: when the teacher reads reasoning steps far beyond the student's current competence, the targets become too strong for the student to absorb, and the student either ignores them or collapses.
The Beta-policy controller is the mechanism. A small Beta distribution is parameterized over the reveal ratio in [0, 1]. The controller observes a handful of training-state statistics, samples a reveal ratio, holds it for a short window of student updates, and the held decision is scored by a discounted learning-progress reward. The discounting matters because the immediate loss change after one decision is too noisy to credit-assign; the discounted return over the next several steps is more informative. This is the same statistical machinery that lets PPO and other RLHF policy-gradient methods stay well-defined under sparse rewards, transplanted into the distillation outer loop.
The connection to the wiki's running thread is now precise. There are three orthogonal axes of teacher-signal control. Extrapolation Cliff (05-14) is the closed-form predictor: given three observables (the student's per-token probability of the correct continuation, the upper-bound clip ratio in the PPO update, and the format-collapse threshold), there is a formula λ-star(p, b, c) above which uniform OPD breaks. The paper used this to pre-register binary predictions on Amazon Fashion data and the predictions landed in their locked windows. SDAR (05-15) is the student-side gate: a sigmoid over detached token-level features decides whether to attenuate a given teacher rejection or strengthen a positive-gap target. Used as a gated auxiliary inside multi-turn RL, SDAR delivered +9.4% on ALFWorld and similar gains on Search-QA and WebShop over GRPO (Group Relative Policy Optimization, the lightweight RL recipe most reasoning post-training pipelines now use). ATESD today is the teacher-side knob: the teacher's information advantage is modulated on the teacher side. None of the three papers references the other two. The joint composition has not been written. The natural framing is that Cliff selects the branch (whether to distill at all on this batch), ATESD tunes the teacher's exposure within OPD, and SDAR gates the student's absorption.
Why it matters: OPSD is in every modern reasoning model's training pipeline. Every one of them has been silently leaving improvement on the table by giving the teacher full reference exposure. ATESD's gains (+0.95 to +2.33 Average@12 across three model sizes on the AIME and HMMT math olympiad benchmarks) are consistent enough to suggest the effect is structural. Exposure scheduling will be a default in the next generation of distillation pipelines.
Research angle: (1) Cliff-derived closed form for optimal exposure. Is there a formula in Cliff's three observables that recovers ATESD's learned controller within 0.5 Average@12? Falsifiable. (2) ATESD + SDAR joint formulation. Orthogonal axes; the composition has not been written. (3) ATESD for cross-modal distillation. DiffusionOPD (05-15) lifted OPD into continuous-state diffusion models for text-to-image; ATESD's reveal-ratio extends naturally to image-token grouping. (4) Curriculum-effect check. Does the learned exposure trajectory look curriculum-like? SU-01 (05-15) used a reverse-perplexity curriculum on SFT data to instill proof-search behavior in a 30B-A3B reasoning model with only 200 RL steps; ATESD's controller may be discovering the analogous curriculum on the teacher side. One-figure ablation.
LiSA: lifelong safety adaptation with posterior-gated rule reuse
Yesterday a cluster of six papers (STALE, Preping, EvolveMem, MemEye, MemLens, BOOKMARKS) made agent task memory a programmable substrate with its own learning dynamics. LiSA pulls the same architecture into agent safety memory, with a Bayesian gate that prevents the standard "rule that worked twice gets applied a hundred times" failure mode.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2. Guardrails, agent safety, memory-augmented adaptation
The framing problem LiSA addresses is the gap between two unsatisfying defaults in agent safety. Pre-deployment guardrails are brittle because they cannot encode site-local policies that only show up in real deployment. Repeated fine-tuning to address new policies is operationally infeasible. LiSA's third path is to treat the base guardrail as a substrate and bolt a structured memory layer on top of it. The memory converts sparse deployment failures into reusable policy abstractions, stores conflict-aware local rules to prevent overgeneralization, and gates reuse via an evidence-aware posterior lower bound rather than a point estimate of past accuracy.
The posterior-lower-bound gate is the technically interesting piece. Most memory-augmented safety baselines reuse a rule when its empirical past accuracy crosses a threshold. The known failure mode: a rule that has worked twice in two attempts has a point-estimate accuracy of 100% and gets reused everywhere, until the failure shows up on the third or thirtieth context that was actually different. A Bayesian posterior with evidence-aware confidence solves this by gating reuse on the lower bound of the credible interval, not the mean. A rule needs accumulated independent uses before its lower bound rises enough to graduate to high-confidence application. Reported results on three safety benchmarks (PrivacyLens+ on data-leakage scenarios, ConFaide+ on confidentiality reasoning, AgentHarm on tool-misuse): consistent outperformance under sparse feedback, 20%-label-flip robustness, and latency-performance frontier extension beyond what backbone scaling delivers.
The cross-paper thread is the safety-side mirror of yesterday's harness-as-load-bearing finding. WildClawBench (05-15) showed an 18-point spread between harnesses running the same model on the same tasks. LiSA suggests that guardrails are no longer monolithic per-deployment artifacts; they accumulate context across reports and have their own learning curve. The wrapper around any single model is now two programmable substrates: the eval and the safety layer, both with their own learning dynamics.
Why it matters: As agents move into tool-using deployments, guardrail failures become concrete operational harms (leaked secrets, unauthorized actions, regulatory exposure). LiSA is the first paper in the wiki that treats safety-rule reuse as an evidence-accumulation problem with proper Bayesian gating, rather than as a memory-augmented classifier. The posterior-lower-bound idea is portable and will likely appear in other places.
Research angle: (1) LiSA + AgentLens. AgentLens (05-14) is the process-aware labeling system that found 10.7% of SWE-bench Verified passes are Lucky Passes, where the right answer falls out for the wrong reasons. Apply AgentLens to LiSA-accepted decisions: how many "safe" passes are Lucky? Falsifiable: a follow-up reporting this fraction with and without the posterior gate. (2) Federated LiSA across organizations. Memory-as-policy is also memory-as-leak. Cross-org rule transfer with privacy preservation is the obvious extension. (3) LiSA composed with Ken Huang's continuous adversarial validation. The Mythos piece in this week's Gmail-starred batch describes a continuously learning offensive policy library (red-team rules accumulate the same way LiSA's blue-team rules do); the paired offensive-defensive system has not been built.
FrontierSmith and SPIN: open-ended coding data, and DAG-validated planning
Two practical agent papers in one batch. FrontierSmith generates open-ended coding training data from closed-ended seeds, with an idea-divergence filter that catches the usual mode-collapse failure. SPIN wraps the planner with a DAG (directed acyclic graph) contract and stops as soon as a partial plan answers the query. SPIN ships measurable improvement on AssetOpsBench, the same operations-agent benchmark that on 05-14 produced a -0.13 correlation between leaderboard score and the hidden "Accomplished" metric.
Sources: HuggingFace Daily Papers Links: FrontierSmith paper · FrontierSmith wiki · SPIN paper · SPIN wiki Tier: 2. Synthetic training data, agent planning, cost control
FrontierSmith is the third paper this week where the model writes its own training substrate. EvoEnv (05-15) constructed verifiable RL environments by generating Python programs that sample instances, compute references, and score responses, where the structural invariant is the solve-verify asymmetry (the model can write a verifier once that it cannot reliably execute by reasoning in natural language). EvolveMem (05-15) self-evolved retrieval configuration from per-question failure logs and improved LoCoMo (a long-context memory benchmark) scores by +25.7% over the strongest baseline. FrontierSmith evolves training problems from a closed-ended seed corpus by mutating goals, restricting outputs, and generalizing inputs, then pruning to high-divergence variants via a quantitative idea-divergence metric that catches near-duplicates. Agents then generate test cases and verifiers for the survivors. Qwen3.5-9B gains +8.82 on FrontierCS and +306.36 Elo on ALE-bench (a competitive-programming arena scored by Elo against other models); Qwen3.5-27B gains +12.12 and +309.12. The idea-divergence filter does the same shape of work as EvoEnv's solve-verify asymmetry: a quantitative diversity prior is the load-bearing trick, not the generation step itself.
SPIN is the deployment-side planning wrapper that addresses the brittleness of free-form LLM plans. It runs in two stages. First, the planner's output is forced into a strict DAG (directed acyclic graph) contract using a validate-and-repair prompting cycle, before any execution starts. Second, the DAG is evaluated prefix-by-prefix and execution stops the moment the current prefix already answers the query. On AssetOpsBench, the operations-agent benchmark with 261 scenarios: total executed tasks drop from 1061 to 623, the hidden Accomplished metric rises from 0.638 to 0.706, and average tool calls per run drop from 11.81 to 6.82. This is the same benchmark that on 05-14 surfaced a -0.13 correlation between the public leaderboard accuracy and the hidden Accomplished metric, meaning higher leaderboard scores correlated with worse actual accomplishment. SPIN is the first published wrapper improvement on AssetOpsBench. Whether SPIN closes that -0.13 gap (the load-bearing question) or just improves both numbers in parallel is unanswered.
Why it matters: FrontierSmith addresses the "where does open-ended coding training data come from" question that has been a soft constraint on agentic post-training. SPIN addresses the "can a structured planning wrapper provide cheap wins on a leaderboard that already exists" question. Both are practical, both ship code-level mechanisms, and both fit into the agentic stack that converged yesterday (Orchard for training infrastructure, SDAR for stable multi-turn RL, EvoEnv for verifiable environments, WildClawBench for native-runtime evaluation).
Research angle: (1) FrontierSmith + EvoEnv composition. Problem synthesis with a built-in solve-verify check. Untested. (2) SPIN under WildClawBench native runtime. The 18-point harness-sensitivity number suggests SPIN's improvement may shift significantly when the harness changes. (3) Idea-divergence beyond coding. The metric is domain-general; transfer to math, scientific discovery, and agentic workflows is open.
→ FrontierSmith wiki · SPIN wiki
Industry Pulse
- Anthropic raising $30B at a reported $900B valuation, surpassing OpenAI (The Decoder). Three months after the previous $30B round, the valuation jumps to $900B. Annualized revenue is approaching $45B, fivefold growth since end of 2024. The funding sits in a tight cluster with two other Anthropic moves this week: a $200M four-year partnership with the Gates Foundation across health, education, agriculture, and economic mobility (Anthropic, surfaced via AI Breakfast in Gmail-starred), and a policy paper titled "2028: Two scenarios for global AI leadership" arguing that the US and its allies may be able to lock in a 12-24 month frontier-AI lead by 2028 if China's access to advanced compute and copied model outputs is closed (Anthropic, The Decoder). The framing of Anthropic has shifted from "another AI lab" to "AI infrastructure layer," and the funding is the signal.
- Microsoft pulls Claude Code licenses, redirects developers to GitHub Copilot CLI (The Decoder). Thousands of Microsoft developers had been using Claude Code internally. Microsoft's reversal pairs directly with the WildClawBench (05-15) finding that harness choice shifts the same model by 18 points: if 18 points of performance lives in the harness, owning the harness is strategic, and renting it from a frontier competitor is not. Five frontier labs now own a coding-agent CLI (Anthropic Claude Code, OpenAI Codex desktop plus today's mobile, xAI Grok Build CLI, Google Gemini CLI in private beta, GitHub Copilot CLI). The model API is a commodity; the agent harness is where the spend lands.
- OpenAI ships Codex on iOS and Android (The Decoder, with the original OpenAI announcement here via AI Breakfast Gmail-starred). The mobile app lets a user steer a Codex agent running on their machine from their phone: review diffs, approve commands, inspect screenshots, switch models. AI Breakfast's framing is correct: "AI coding is turning into a work queue." Pair this with GitHub Copilot's REST-API trigger endpoint for cloud agent tasks (GitHub blog) and Google's Genkit Middleware, and agent work is moving from manual session to programmatic dispatch.
- xAI's Grok Build CLI is confirmed by The Decoder (The Decoder). RSS-side confirmation of the @xai Twitter announcement from 05-15. Same launch, same framing (catch-up move into the coding-agent space). Confirms the five-lab CLI race framing.
- ChatGPT adds Plaid integration for bank-account access (The Decoder). GPT-5.5 Thinking analyzes real transactions for Pro users in the US. OpenAI's disclaimer is that it is not a licensed financial advisor. The deeper context: this is the data side of the agent-CLI thread. Once an agent has bank-account data, the harness owns the customer relationship. Pair against the Gallup poll showing 70% of Americans oppose AI datacenters near them (surfaced by Algorithmic Bridge this week); consumer-financial integrations may be a higher-risk path than infrastructure.
- Anthropic + Gates Foundation $200M partnership (Anthropic news via AI Breakfast Gmail-starred). Four-year program covering global health, education, agriculture, and economic mobility, with grants, Claude credits, technical support, and benchmarks. The civic-infrastructure framing is strategic in two directions. PR (Claude as public-good infrastructure) and revenue (Foundation grants flow back as Claude API spend).
- arXiv tightens penalties for unchecked AI-generated content in papers (The Decoder). The preprint server formalizes its enforcement on AI-generated content quality. Adjacent to today's LLM-based detection of manipulative political narratives HuggingFace paper, which uses LLMs as reasoning-model labelers; ironically, the same technique that arXiv is policing is being formalized inside arXiv's own preprint corpus.
- Cerebras IPO opens 89% above offering price, raises $5.5B (AI Breakfast Gmail-starred). 30M shares at $185. First clean public-market AI-compute bet outside NVIDIA. The opening shot in what AI Breakfast frames as a year of AI mega-IPOs. Pairs with SemiAnalysis's Cerebras IPO piece from 05-13 arguing tokens-per-dollar is the new throughput metric. The friction point is the data-center backlash Gallup just measured (70% opposed locally).
- OpenAI exploring legal options against Apple (AI Breakfast via Reuters). The Apple-OpenAI partnership is reportedly fraying. Apple Intelligence underdelivered the ChatGPT integration; OpenAI's preferred consumer surface is on a phone, but Apple no longer wants to be that surface.
- Google: AI search needs no separate SEO playbook (The Decoder). New Google documentation dismantles "generative engine optimization" and "answer engine optimization." LLMs.txt files and content chunking get explicitly debunked. Google's framing: AI search runs on the same ranking systems as traditional search. Worth tracking; pairs against the AI-search-as-distinct-surface narrative.
Connecting the Dots
Today's research (HF + Kurate + RSS) Today's industry + social-stream
────────────────────────────────────── ──────────────────────────────────────
Training-time substrate selection: Anthropic $900B (above OpenAI)
Lighthouse Attention (pre-training) ▲
BEAM (MoE expert masks) │ agent-CLI as value capture:
ATESD (teacher exposure) │ Microsoft pulls Claude Code
│ │ GitHub Copilot CLI
▼ │ OpenAI Codex mobile
Inference-stack substrate updates: │ Grok Build CLI (confirmed)
(05-15) Forcing-KV (head-role) │
(05-15) async continuous batching │ harness-as-load-bearing
(05-15) NVFP4 Kimi-K2.6 │ thread (WildClawBench 05-15)
│ │
▼ │
Self-substrate synthesis: ▼
(05-15) EvoEnv (RL envs) Subquadratic Appen 56.2x@1M (Gmail 05-15)
(05-15) EvolveMem (retrieval cfg) + Lighthouse Attention (today)
(today) FrontierSmith (coding data) = subquadratic-train, dense-deploy
confirmed by two independent groups
Agent safety-as-memory:
(today) LiSA (posterior gate) OpenAI Codex mobile
+ GitHub Copilot REST API
Agent planning wrapper: = agent work as programmable
(today) SPIN (DAG validator) queue, not manual session
AssetOpsBench 0.638 → 0.706
Cross-paper thread #1: the uniform-default reform now crosses ten papers and eight stack layers. The pattern started two months ago and is now too consistent to ignore. LongAct (04-18) showed that long-context training-signal density is concentrated in the first 5% of tokens, so uniform gradient updates are wasteful and selective gradients should replace them. TIP (04-16) and Make Each Token Count (04-16, with a follow-up paper on KV cache eviction on 05-12) showed that only about 10% of teacher tokens in distillation carry real signal, so uniform token weighting is wasteful and selective weighting should replace it. Make Each Token Count (05-12) extended the same argument to the KV cache (the memory store that saves prior attention computations), proposing learned eviction policies rather than uniform retention. The Extrapolation Cliff (05-14) derived a closed-form threshold λ-star above which uniform on-policy distillation collapses, replacing the "always distill" default with a "distill only when the math says it is safe" rule, and pre-registered binary predictions on Amazon Fashion data that landed in their locked windows. SDAR (05-15) showed that uniform OPSD gating destabilizes inside multi-turn RL, so a sigmoid over detached features should gate the student's absorption selectively. Today adds three more layers. Lighthouse Attention rejects uniform pre-training attention. BEAM rejects uniform top-K MoE routing. ATESD rejects uniform teacher exposure. Ten papers, eight layers (gradient, token weight, cache eviction, distillation branch, student gate, pre-training attention, MoE expert set, teacher exposure). Any remaining "treat every X equally" default in this stack is now the obvious next target.
Cross-paper thread #2: subquadratic-train, dense-deploy is cross-source confirmed. Subquadratic, a closed-source long-context lab, announced an Appen-independently-validated 56.2x speedup over FlashAttention-2 at 1M tokens and 81.8% on SWE-bench Verified (surfaced via Gmail-starred on 05-15). Lighthouse Attention from Nous Research (today, also retweeted by @bayesiansapien from the @NousResearch announcement on the evening of 05-15) is the open-source training-side counterpart. Two different research groups, same week, same axis, similar headline structure (train with a subquadratic mechanism, ship a model that deploys with dense attention). The wiki has now seen both the proprietary deployment numbers and the open-source pre-training recipe. The joint reproduction (a Lighthouse-trained model with Subquadratic-class inference numbers) is the obvious next experiment.
Cross-paper thread #3: the routing surface now has eight internal layers plus an orthogonal profile axis. BEAM today adds the per-token expert-subset axis. The layers already established are: model-level (TraceR, 04-17, query-embedding classifier for inter-model dispatch), adapter-level (MinT, 05-14, million-scale LoRA catalog as the routing surface), expert-router-level (CaRE, 05-11, router-above-experts for task-level routing), training-time latent-code-level (DLR, 05-15, joint discrete codes + routing policy + model parameters as one training objective), cache-eviction-level (Make Each Token Count, 05-12, learned KV eviction policy), head-role-level (Forcing-KV, 05-15, static-vs-dynamic head split for video diffusion cache compression), distillation-loss-level (SDAR, 05-15, gated OPSD over detached signals), and now per-token expert-set-level (BEAM today). The orthogonal axis is RouteProfile (05-15): structured trainable profiles describing candidate models beat flat domain-level descriptions on generalization to newly added models. The composition that has not been written: BEAM masks consume DLR latent codes consume CaRE task routers consume RouteProfile-structured profiles. A vertically integrated routing system spanning all four layers is one paper away.
Cross-paper thread #4: agent eval improvement on the same benchmark that surfaced the measurement crisis. AssetOpsBench reported a -0.13 correlation between the public-leaderboard accuracy metric and the hidden "Accomplished" metric on 05-14. The public number was rewarding the wrong behavior. SPIN today reports Accomplished rising from 0.638 to 0.706 on AssetOpsBench, with average tool calls dropping by ~42% per run. Whether SPIN closes the accuracy-Accomplished gap (the load-bearing question) or just improves both numbers in parallel is unanswered. Falsifiable in one follow-up.
Cross-paper thread #5: self-substrate synthesis is now a three-paper cluster. EvoEnv (05-15) generates verifiable RL environments where the solve-verify asymmetry is the structural invariant (the model can write a verifier once that it cannot reliably execute in natural language on fresh instances). EvolveMem (05-15) generates retrieval configurations from failure logs and improves LoCoMo by +25.7% relative. FrontierSmith (today) generates open-ended training problems with an idea-divergence quality filter to catch near-duplicates. Three independent papers, same architectural shape: a quantitative diversity prior (solve-verify asymmetry, AutoResearch-style diagnosis on failure logs, idea-divergence) is doing the load-bearing work, not the generation step. Pattern threshold of three crossed; this is now a cluster.
Cross-paper thread #6: memory-as-substrate extends from task memory into safety memory. Yesterday's six-paper agent-memory cluster (STALE at a 55.2% ceiling on implicit-conflict detection, EvolveMem auto-evolving retrieval configuration, Preping building memory before tasks for 2-3x lower deployment cost, MemEye and MemLens both showing multi-session multimodal capped below 30%, BOOKMARKS on storyline memory for role-play) made agent task memory a programmable substrate with its own learning dynamics. LiSA today imports the same architectural treatment into agent safety. The LiSA-specific contribution is the Bayesian posterior lower bound gating rule reuse, where evidence accumulation across deployments matters more than point-estimate accuracy on past traces. Pair this with the Ken Huang Mythos piece from this week's Gmail-starred batch, which describes continuous adversarial validation (Claude Mythos hit 83% first-attempt exploit success and found a 27-year-old OpenBSD bug in pre-release testing). Same architectural diagnosis (memory-as-policy-library, continuous evidence accumulation), opposite stance (defense versus offense).
Cross-paper thread #7: industry value capture moves to the agent CLI layer. Anthropic at $900B with Microsoft pulling Claude Code licenses on the same day is the same event from two angles. Microsoft's reversal is not a quality judgment, it is a strategic one: if 18 points of model performance lives in the harness (WildClawBench, 05-15), then harness ownership is the moat. Microsoft confirmed this with a procurement decision 48 hours after the WildClawBench paper landed. OpenAI Codex mobile, GitHub Copilot CLI, Grok Build, and Google Gemini CLI all on the same week draw the same lesson: model API is a commodity, agent harness is where users live, and labs are racing to own the surface.
Media-Live morning slot (2026-05-16): see morning synthesis. The strongest @bayesiansapien retweet batch in a week. Fourteen retweets, three of which directly amplify today's HuggingFace batch: the Nous Research Lighthouse Attention announcement, the "Is Grep All You Need?" paper which finds that grep-style text search inside the right coding-agent harness matches or beats embedding retrieval (a direct fit for the harness-as-load-bearing thread), and a two-paper mechanistic-interpretability cluster arguing that the standard assumption of a unique circuit per LLM task is wrong. The AI handle feed is thin (ClaudeDevs rate-limit reset, NVIDIA brand-marketing Catalyst series, WHFraudTF off-topic political content).
Yesterday afternoon and evening slot recap: the afternoon slot carried three retweets, the most substantive being the agentic-AI-as-AGI-path position paper (arXiv 2605.12966) which formalizes agency as routing across memory, reasoning, tool use, self-improvement, and alignment (directly relevant to today's BEAM + ATESD + FrontierSmith cluster). The evening slot had no @bayesiansapien retweets; @nottombrown surfaced Anthropic CFO Krishna Rao's first podcast (>500% net dollar retention, 90% of internal Anthropic code written by Claude Code, run-rate growth from $9B to $30B in one quarter) which is the direct source for the $900B-valuation news landing today.
Worth Watching
- Lighthouse Attention reproducibility at 1B+ parameters, 60 days. Nous's published numbers are small-scale; the load-bearing question is whether the brief recovery phase (where the wrapper is removed and the model trains as a plain dense-attention model for a short window) produces a checkpoint competitive with full-attention training matched on tokens. Falsifiable: a paper reporting Lighthouse-trained 1B+ model matching full-attention training on equivalent tokens, with the recovery-phase loss curve attached.
- BEAM + DLR composition, 90 days. DLR (05-15) showed that jointly-trained discrete latent codes have causally distinct roles (ablating a code produces a recognizable behavior change). If those codes drive BEAM's binary mask network, the per-token expert count becomes a function of the model's internal task representation rather than a per-token guess. Falsifiable: a paper that ships this and reports retention above 98% at sparsity above 85%.
- ATESD + SDAR joint formulation, 60 days. ATESD modulates the teacher's information advantage on the teacher side; SDAR gates the student's absorption on the student side. Orthogonal axes. Neither paper references the other. Falsifiable: a paper reporting the joint formulation with a measurable gain over either alone on AIME 24/25 or HMMT 25.
- AssetOpsBench accuracy versus Accomplished gap with SPIN, 30 days. SPIN moves Accomplished from 0.638 to 0.706 today. Whether the public-leaderboard accuracy metric also moves, or whether the -0.13 correlation flips sign under SPIN, is one paper away. Falsifiable: a follow-up reporting both numbers with SPIN on the same scenarios.
- Cross-architecture Lighthouse for fine-tuning, 90 days. Most production deployments need long-context supervised fine-tuning rather than long-context pre-training. The Lighthouse wrapper should transfer naturally because the kernel and the model architecture are unchanged. Reproducible result on a public model is the obvious next experiment.
- Microsoft GitHub Copilot CLI's WildClawBench score, 60 days. Microsoft just made a procurement decision to swap Claude Code for Copilot CLI internally. The natural calibration is to run both inside WildClawBench's native-runtime harness. If Copilot CLI does not close at least half of the 18-point harness-spread gap to Claude Code under WildClawBench, Microsoft's swap was strategic rather than technical.
- LLM-rated underrated from Kurate (Kurate's current weekly cs.AI and cs.LG leaderboards, where papers are ranked by 3-LLM tournaments rather than upvotes): cs.AI #11 "Hodoscope: Unsupervised Monitoring for AI Misbehaviors" (ai_rating 7.2, by Ziqian Zhong and Aditi Raghunathan, an unsupervised method for flagging anomalous model behavior in deployment). cs.AI #13 "Emotion Concepts and their Function in a Large Language Model" (ai_rating 8.2, by William Saunders and Tom Henighan, recurring from last week; the structural absence from HuggingFace is now visible). cs.AI #19 "Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation" (ai_rating 5.5 but Tier 1 by topic; argues unsafe behaviors transfer through distillation without explicit demonstration, directly relevant to today's ATESD). cs.LG #9 "LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking" (ai_rating 6.8, on RLVR or reinforcement learning with verifiable rewards, and the failure mode where the policy games the verifier instead of solving the task). cs.LG #11 "LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit" (ai_rating 8.0, recurring from last week; argues a single shared circuit drives both sycophancy and confabulation; one mechanistic intervention may address two failure modes). cs.LG #19 "Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling" (Tier 2 but directly relevant to the attention-memory scheduling axis Lighthouse opens today). Six papers Kurate-rated above HuggingFace visibility this week; the sycophancy-lying shared-circuit one is the most actionable for interpretability.
- Rising authors from Kurate: no authors crossed threshold this week. Last week also produced none. The threshold (≥3 top-10 appearances in past 4 weeks, score ≥15) may be calibrated too high for the leaderboard cadence. Worth reviewing in
connectors/kurate/farmer.py. - Cross-source confirmation note (HF + Kurate): today's HuggingFace batch and the current Kurate weekly leaderboards have no direct paper overlap (Kurate's window covers April papers; HF's covers May). The cross-source-confirmed Tier 1 promotion rule did not fire this run.
Quick Hits
OmniBoost / OmniClean (arXiv 2605.12034). Omni-modal LLMs (audio + image + text + video) are quietly inflating gains via visual shortcuts in benchmarks. The authors audit 9 omni-modal benchmarks, run visual-only probing, drop visually solvable queries, and build OmniClean (8,551 retained from 16,968 originals). On OmniClean, a three-stage post-training recipe (mixed bi-modal SFT, mixed-modality RLVR, SFT on self-distilled data) lifts a 3B Qwen2.5-Omni to match a 30B Qwen3-Omni-A3B without using a stronger omni-modal teacher. Tier 3 vision; useful as a debiased-eval template for any modality.
WildTableBench (arXiv 2605.01018). 402 high-density real-world table images, 928 questions, 21 frontier multimodal models tested. Only one model crosses 50% accuracy. Structural perception and numerical reasoning are the persistent weaknesses. Continues the eval-ceiling pattern that has held for four consecutive days: every new honest measurement lens lowers the previously-reported ceiling.
FEST (arXiv 2605.15012). Few-shot demonstration-guided RLVR. 128 demonstrations randomly selected from an SFT dataset suffice when combined with on-policy signal and decaying weights. Matches full-dataset SFT-then-RLVR with orders of magnitude less data. Tier 2 with implications for cheaper RLVR pipelines.
LC-MAPF (arXiv 2605.07637). Local communication module for multi-agent pathfinding via learnable multi-round message exchange between neighboring agents. Outperforms IL/RL-only solvers and preserves scalability (typically the bottleneck of communication-based MAPF). Tier 4 robotics.
IntentVLA (arXiv 2605.14712). History-conditioned Vision-Language-Action framework: encode recent visual observations into a compact short-horizon intent representation, condition the action chunk on it. Solves the observation-aliasing problem where frame-conditioned VLAs resample inconsistent intents across replanning steps. Introduces AliasBench (12-task RoboTwin2 benchmark). Tier 4 robotics.
Pace-and-Path Correction (arXiv 2605.11459). Training-free closed-form inference-time operator for chunked-action VLAs. Decomposes into a pace channel (compress along planned direction) and a path channel (orthogonal spatial offset). +28.8% and +25.9% absolute success-rate over foundational VLA models on MoveBench. Tier 4 robotics; the closed-form structure is similar to other training-free wrappers landing this week.
PanoWorld / PhyMotion / Realiz3D / SAT3DGen / VGGT-Edit. 3D and panoramic world models for spatial reconstruction. Tier 4; skip.
ViMU (arXiv 2605.15188). Video metaphorical understanding benchmark. Tier 3 multimodal.
PRISM (arXiv 2605.15182). Prior-rectification and uncertainty-aware structure modeling for depth estimation. Tier 4.
LLM-Based Detection of Manipulative Political Narratives (arXiv 2605.14354). Reasoning-model filter + UMAP + HDBSCAN over 1.2M social-media posts identifies 41 manipulative narrative clusters. Adjacent to the responsible-ai thread. The LLM-as-classifier framing has known fragilities, and today's arXiv enforcement-tightening news is the calibration signal.
Ideology Prediction of German Political Texts (arXiv 2605.14352). DeBERTa-large achieves 0.844 F1 on the political-spectrum projection task; out-of-domain X-Twitter test ACC 0.864. Tier 4, included for the responsible-ai cluster on LLMs in political analysis.
Does Synthetic Layered Design Data Benefit Layered Design Decomposition? (arXiv 2605.15167). Pure synthetic data beats partially-real PrismLayersPro for graphic-design decomposition. Saturation at ~50K samples. Tier 4 graphic design.
PreScam (arXiv 2605.12243). Scam progression benchmark from real-world reports; 177,989 reports filtered to 11,573 conversational scam instances. Reasoning-model labelers struggle on progression versus static scam detection. Tier 3 responsible-ai.
Algorithmic Bridge Weekly Top Picks #121 (Substack). Notable callouts in this week's roundup: "The hottest job in AI pays $630K and it's not building models" (sales engineering and Forward Deployed Engineer pattern continues), Trump-Xi AI safety talks in Beijing, Andrew Ng on no AI jobpocalypse, frontier models fixing benchmarks instead of solving them (continues the eval-ceiling thread), 70% American opposition to AI datacenters near them (Gallup; pairs with Cerebras IPO timing), Claude 4 rebutting Claude 3's case for AI consciousness.
Gary Marcus on US AI policy chaos (Marcus on AI via Gmail-starred). Marcus + Sonnenfeld + Henriques essay in Fortune: roughly 1,200 AI bills introduced, ~150 enacted, no coherent framework. Argues for a structured "which questions get asked, in what order" approach. Pairs with arXiv's enforcement crackdown today: 2026 is the year of attempted AI-policy correction, arriving mostly as patchwork.
Ken Huang on automated security validation (Mythos) (Agentic AI Substack via Gmail-starred). Detailed framing of Claude Mythos (Anthropic's April 2026 vulnerability-discovery system, which hit 83% first-attempt exploit success and found a 27-year-old OpenBSD bug in pre-release testing) and the structural collapse of the time-to-exploit window (771 days in 2018 to sub-hourly in 2024). Argument: continuous automated security validation closes the offense-defense gap that AI-driven attackers widen. Cross-pairs with today's LiSA: same diagnosis (memory-as-policy-library, continuous evidence accumulation), opposite stance (defense versus offense).
Simon Willison: iNaturalist clumper 0.1 (blog). Side-project tooling release; no AI bearing. Skip.
Reddit highlights:
- r/LocalLLaMA: Orthrus-Qwen3-8B claims up to 7.8x tokens-per-forward-pass with provably identical output distribution (thread). Direct practitioner confirmation of the Orthrus paper from 05-14, which trained an autoregressive head and a diffusion head to share one KV cache and achieve consensus that makes the output bit-identical to plain AR generation. 7.8x speedup is the high end; production-grade reproduction is now public.
- r/LocalLLaMA: Qwen3.6-35B-A3B with multi-token prediction, million-token test (thread). Practitioner runs three sessions on an AMD RDNA 4 R9700 with 32GB VRAM. 1.5x tokens-per-second vs prior tests. 300K context with KV cache quantized to Q4 (a 4-bit numeric format that halves memory vs Q8). Practitioner calls multi-token prediction "100% game changer for local LLMs." Confirms the multi-token-prediction trend.
- r/LocalLLaMA: China-modded 4090 48GB practitioner-research call (thread). Substantive call for collective research on gray-market Chinese cards that double the standard 24GB VRAM of an RTX 4090. Connects to the Anthropic 2028 paper framing that China is using "smuggled chips, offshore data centers, distillation attacks." The same hardware is both an export-controls evasion route and the open-source local-inference dream.
- r/MLScaling: Lighthouse Attention discussion thread surfaces Nous's paper before its HuggingFace appearance (thread). Independent surfacing path; the practitioner sub caught the paper before the HuggingFace Daily Papers index. Worth tracking as a forward-signal channel.
- r/MLScaling: Prime Intellect auto-nanoGPT (thread). Autonomous AI research scaled to 14K GPU-hours surpasses human state-of-the-art on the nanoGPT speedrun, but with "lack of novel ideas." Continues the AI-doing-AI-research thread that the @ZabihullahAtal retweet on @bayesiansapien this morning surfaced (the Stanford + OpenAI + DeepMind + Anthropic interview paper on the same topic).
- r/ControlProblem: Sanders + AOC introduce a bill to pause all AI datacenter construction (thread). Score 98. Direct policy pairing with Gary Marcus's chaos-of-1,200-bills piece and the Gallup 70%-opposition number. Falsifiable hypothesis: this bill does not pass, but the political pressure shows up as zoning friction.
Sources ingested today: HF (27 new papers, total 53 for the 2026-05-15 window), RSS (13 new posts dated 2026-05-15: 7 The Decoder, 1 Marcus on AI, 1 Algorithmic Bridge, 1 Ken Huang agentic-ai, 2 Simon Willison, 1 unattributed), Gmail (1 starred set: Marcus, Ken Huang Mythos, AI Breakfast 05-15 with Codex Mobile, Cerebras IPO, Anthropic-Gates, OpenAI-Apple), Twitter morning slot (21 tweets, 14 retweets, 11 articles) plus 2026-05-15 afternoon (8 tweets, 3 retweets) plus 2026-05-15 evening (4 tweets, 0 retweets), Kurate cs.AI + cs.LG weekly leaderboards (no rising authors crossed threshold, no HF+Kurate paper overlap this run because Kurate's window is April while HF's is May), Reddit (8 subs scraped: LocalLLaMA 8, MLScaling 5, LLMDevs 1, ControlProblem 3, CUDA 1, HPC 0, MachineLearning 0, reinforcementlearning 0), parallel Daily-Digest (none for 2026-05-16, latest is still 2026-04-23) | Wiki pages updated: 9 (6 new summaries: Lighthouse, BEAM, ATESD, LiSA, FrontierSmith, SPIN; 3 concept updates: kv-cache.md, knowledge-distillation.md, llm-routing.md)