cere-bro | 2026-05-17

The HuggingFace feed largely re-surfaced yesterday's Tier 1 batch, but the real signal landed elsewhere. The Kurate weekly leaderboard surfaced an ai_rating 9.0/10 paper from Gatsby UCL deriving the first principled scaling theory for Mixture-of-Experts (MoE) models. Sebastian Raschka catalogs the architectural diversity that ships in the May 2026 open-model wave (Gemma 4, DeepSeek V4, Laguna XS.2, ZAYA1, Kimi K2.6). MTP (Multi-Token Prediction) decoding merged into llama.cpp and r/LocalLLaMA reports a ~2x decode speedup on Qwen3.6-27B at 5-turn chat. The Twitter morning slot is the quietest of the week (two sparse tweets, zero retweets, zero articles).

TL;DR

MoE-muP (Vankadara et al., Gatsby UCL plus Amazon plus Tübingen, arXiv 2605.14200, Kurate cs.LG #13 with ai_rating 9.0/10). First principled scaling theory for Mixture-of-Experts. Extends the Maximal Update Parameterization (muP, the rule that lets dense-model hyperparameters transfer across width) to MoE by deriving the Maximally Scale-Stable Parameterization (MSSP) across all five axes: number of experts M, expert width Ne, routing sparsity K, network width N, depth L. The headline claim is that muP alone is not sufficient inside MoE because the router, expert update statistics, and active-parameter count couple non-commutatively across the new axes. The recipe is closed-form and covers a broader range of optimizers than concurrent work.
Raschka's architecture survey (Gmail-starred Ahead of AI). Catalogs four moves shipping in May's open-model wave: Gemma 4's KV sharing (later layers reuse earlier layers' K and V projections, halving cache size, 2.7 GB saved at 128K context on the 2B model) plus per-layer embeddings (PLE: small per-layer embedding tables that lift effective parameter count without growing the transformer stack); Laguna XS.2's layer-wise attention budgeting; ZAYA1-8B's compressed convolutional attention; DeepSeek V4's mHC (multi-Head Compression) plus compressed attention. Every move is a long-context-efficiency play.
MTP merged in llama.cpp. PR #22673 lands Multi-Token Prediction (a speculative-decoding family where the model emits multiple draft tokens per step). Strix Halo benchmarks on Qwen3.6-27B at 5-turn chat with ~28.5K context: total wall-clock 258.65s drops to 200.55s (-22.46%); generation throughput +136%, prompt processing -18%. At 35B the picture flips: total wall-clock is slightly slower because prompt processing regresses more than generation gains. Speculative decoding now has a confirmed consumer-hardware envelope.
CurveBench (arXiv 2605.14068). 756 images of nested Jordan curves (closed non-intersecting planar curves); the task is to recover the rooted containment tree. Gemini 3.1 Pro reaches 71.1% on Easy and 19.1% on Hard. RLVR fine-tuning of Qwen3-VL-8B lifts CurveBench-Easy from 2.8% to 33.3%, exceeding GPT-5.4 and Claude Opus 4.5 under the same evaluation protocol.
LIFE survey (arXiv 2605.14892). 200+ paper survey of LLM multi-agent systems organised along four causally linked stages: Lay capability foundation, Integrate collaboration, Find faults via attribution, Evolve via self-improvement. Names error propagation across agents and rounds as the central under-examined risk.
Industry pulse. The May 2026 open-model wave is now visible as one event: Gemma 4 (Google, Apache 2.0, four sizes including 26B-A4B MoE), Kimi K2.6 (moonshotai, long-horizon focused), GLM-5.1, Qwen3.6 (35B-A3B variant open), Laguna XS.2 (poolside, 33B-A3B coding), MiMo-V2.5-Pro (Xiaomi, Apache 2.0), DeepSeek V4 (Pro 1.6T-A49B and Flash 284B-13B). Fireworks adds full-parameter Kimi K2.6 tuning at 256K context, GLM 5.1 LoRA RL, Qwen3.6 27B managed fine-tuning, Gemma 4 dense full-param plus LoRA RL all at 256K. CAISI's IRT-Elo leaderboard reports a widening open-closed gap; Florian inside Interconnects argues a substantial fraction is harness artifact (running open models without their preferred coding agent).

The Big Picture

The headline of the week was supposed to be the 17-May HuggingFace top, but in the strict ingest-discipline sense most of yesterday's batch is the same batch. The HF feed re-surfaced BEAM (binary expert activation masking, the per-token learned expert-mask MoE paper covered in 05-16's digest), Lighthouse Attention (the training-only kernel-decoupled long-context wrapper from Nous Research covered on 05-16), ATESD (the Beta-policy controller that makes teacher-exposure-ratio learnable in self-distillation, 05-16), LiSA (the posterior-gated safety-rule memory layer, 05-16), FrontierSmith and SPIN (the agentic data and planning wrappers, 05-16), DLR and RouteProfile and Forcing-KV (the routing surface trio, 05-15), and the agent-memory cluster of STALE-Preping-EvolveMem-MemEye-MemLens-BOOKMARKS (05-15). Treat that as continuing thread context; the substantive new signal landed on three other surfaces.

The first surface is Kurate. The cs.LG weekly leaderboard surfaced an ai_rating 9.0/10 paper by Vankadara et al. (Gatsby UCL plus Amazon plus Tübingen) deriving the first principled scaling theory for Mixture-of-Experts. The wiki's running thread on MoE routing (BEAM 05-16 making the K-per-token mask learnable, DLR 05-15 jointly training discrete latent codes and routing policy, CaRE 05-11 routing above existing MoE experts, RouteProfile 05-15 making the candidate-model description structured) has been operating without a theoretical foundation. Vankadara et al. supply it. The closed-form MSSP prescription tells you how to scale the number of experts M, the expert width Ne, the routing sparsity K, the network width N, and the depth L jointly so that pre-training dynamics stay scale-invariant. This is the same economic logic muP (the dense-model scale-transfer rule) delivered for dense pre-training in 2022: tune at 1B, train at 1T, no re-sweep at the target scale. If the MSSP recipe transfers cleanly to a frontier-scale MoE run, the cost-of-discovery for new MoE architectures drops by roughly an order of magnitude.

The second surface is Gmail-starred. Sebastian Raschka's Ahead of AI post (the most substantive Gmail item in two weeks) is the architectural census of the May open-model wave. Gemma 4 ships KV sharing (later layers reuse earlier layers' K and V projections of the same attention type, halving cache size for ~2.7 GB savings on the 2B model at 128K context) plus per-layer embeddings (the "E" in E2B and E4B stands for "effective": small per-layer embedding tables lift effective parameter count without growing the transformer stack's compute). Laguna XS.2 from poolside ships layer-wise attention budgeting, recognising that different layers contribute differently to long-range mixing. ZAYA1-8B replaces a fraction of softmax attention with compressed convolutional attention. DeepSeek V4 ships mHC (multi-Head Compression, aggressive per-head output compression along the residual) plus compressed attention. Every move spends architectural novelty on long-context efficiency, not on raw capability. This is the empirical evidence to the theoretical claim Vankadara et al. made on Kurate: the open frontier has been operating on architectural folk knowledge, converged on similar moves anyway, and is now ready for the principled recipe.

The third surface is r/LocalLLaMA and r/CUDA. MTP (Multi-Token Prediction, the speculative-decoding family that emits multiple draft tokens per step from one forward pass) merged into llama.cpp upstream this week. Strix Halo benchmarks on Qwen3.6 confirm what frontier papers predicted on H100 hardware: at 27B, 5-turn chat with ~28.5K context drops from 258.65s to 200.55s wall-clock, generation throughput jumps 136%, but prompt processing regresses ~18%. At 35B the regression in prompt processing dominates and total wall-clock slips ~11%. The result is the consumer-hardware envelope of speculative decoding now has a concrete crossover point: roughly 30B and roughly multi-turn workloads. Cutile-rs released a beta on r/CUDA the same week with peak performance on B200 Blackwell and (per the author) cleaner syntax than Triton. Two pieces of the inference stack moved from research to production in a week.

Deep Dives

MoE-muP: principled scaling for Mixture-of-Experts

The first scaling-law framework that applies to the architectures shipping in 2026.

Source: Kurate cs.LG weekly leaderboard #13 (TrueSkill 1611, win rate 85%, ai_rating 9.0/10) Links: Paper · Wiki Tier: 1 (routing, MoE scaling, hyperparameter transfer)

   Dense pre-training scaling (muP, 2022)        MoE pre-training scaling (MSSP, 2026)
   ────────────────────────────────────────      ──────────────────────────────────────
   axes: width N, depth L                        axes: width N, depth L, experts M,
                                                       expert width Ne, sparsity K

   muP rule: re-parametrize so feature           muP alone breaks because:
     dynamics stay invariant under width            - router gradients miscalibrate
     scaling. Optimal hyperparams transfer.         - active-param count drifts
                                                    - non-commutative co-scaling

   tune at 1B -> train at 1T, no re-sweep        MSSP: derive closed-form initialization,
                                                  learning rate, weight decay, routing
                                                  temperature as functions of (M, Ne, K,
                                                  N, L). Covers SGD, Adam, Adafactor.
                                                  Tune at small proxy scale -> train at
                                                  full scale, no re-sweep.

The paper makes three load-bearing claims. First, naively applying width-muP to expert blocks leaves the router undertrained because the router's gradient norm scales differently with M (number of experts) than the expert blocks' gradient norms scale with Ne (per-expert width). The paper identifies which specific muP prescriptions fail and where the failure surfaces in the optimizer-state interaction. Second, the combinatorial co-scaling space of (M, Ne, K, N, L) is non-commutative: scaling experts first then expert width does not produce the same trained model as scaling expert width first then experts, even at matched FLOPs. The authors use Dynamical Mean Field Theory (DMFT, the mean-field tool that lets you analyse training trajectories as scale-invariant dynamics) to characterise the distinct fixed points across regimes. Third, the resulting MSSP prescription is closed-form across a broader set of optimizers than the concurrent Jiang et al. 2026 proposal, and it repairs muP shortcomings specific to MoEs that the concurrent paper does not address.

The wiki's running MoE routing thread now has its theoretical foundation. BEAM (summary, the 05-16 paper that replaced fixed top-K MoE routing with a per-token learned binary mask trained end-to-end via straight-through estimator) optimises K within a fixed pre-training scale. MoE-muP makes the pre-training scale itself principled. The orthogonal axes are now visible: BEAM tunes the per-token K decision; DLR (summary, the 05-15 paper that jointly trains discrete latent codes and routing policy with causally distinct ablation effects) tunes how the routing policy is learned; MoE-muP tunes the scaling of M, Ne, K, N, L jointly. The composition that has not been written: a frontier-scale MoE pre-trained under the MSSP prescription, with BEAM-style per-token masks tracking DLR-style learned latent codes.

The empirical evidence the paper was waiting for arrived in the same week. Raschka's architecture survey (Gmail-starred, Wiki) catalogs the May 2026 open-model wave: Gemma 4 (Google, 26B-A4B MoE plus dense variants, KV sharing plus per-layer embeddings), DeepSeek V4 (Pro 1.6T-A49B and Flash 284B-13B, mHC plus compressed attention), Kimi K2.6 (moonshotai, long-horizon focused), GLM-5.1, Laguna XS.2 (poolside, 33B-A3B coding), MiMo-V2.5-Pro (Xiaomi, Apache 2.0). Every one is a MoE. Every one chose different (M, Ne, K) tradeoffs without a principled recipe. The MoE-muP paper's prescription is the back-fit that would tell you which of those tradeoffs is on the scale-stable frontier and which is fragile to width or depth changes.

Why it matters: Frontier MoE training runs cost tens of millions of dollars per sweep iteration. If MoE-muP's MSSP recipe transfers cleanly, the cost-of-discovery for new MoE architectures drops by roughly an order of magnitude. Pair with the Fireworks training-platform updates this week (full-parameter Kimi K2.6 tuning at 256K, GLM 5.1 LoRA RL): the commodified training-platform layer plus a principled scaling recipe means smaller labs can choose MoE hyperparameters in advance rather than by expensive empirical sweep.

Research angle: Four open problems. (1) MoE-muP plus BEAM joint formulation. Falsifiable in one paper: train a backbone family across a 10-100x scale jump with BEAM-style binary masks and MoE-muP scaling, measure both 98%+ retention (BEAM's headline) and hyperparameter transfer (muP's headline). (2) MoE-muP under hybrid attention. Modern MoEs increasingly mix softmax with linear, sliding-window, or compressed-convolutional attention (Raschka's catalog). Whether MSSP holds under hybrid attention is unaddressed in the paper. (3) Back-fitting MSSP against frontier recipes. The simplest empirical falsifier is to compare MSSP's prescribed (M, Ne, K) for a given budget against Kimi K2.6's, DeepSeek V4's, and Gemma 4 26B-A4B's actual choices once those tech reports surface. (4) MSSP for RL hyperparameters. Whether the DMFT machinery extends from pre-training to RLVR hyperparameter transfer is the natural next theorem.

→ Full summary

Raschka's architecture survey: the empirical complement to MoE-muP

Four architectural moves that ship in the May 2026 open-model wave, each spending its novelty on long-context efficiency.

Source: Sebastian Raschka, Ahead of AI (Gmail-starred 2026-05-16) Links: Original post · Wiki Tier: 1 (KV cache, attention compression, MoE architecture)

   Gemma 4               Laguna XS.2          ZAYA1-8B             DeepSeek V4
   (Google, Apache)      (poolside, 33B-A3B)  (8B)                 (1.6T-A49B + 284B-13B)
   ────────────────      ──────────────────   ────────────────     ────────────────────
   KV sharing across     Layer-wise           Compressed           mHC (multi-Head
   layers (later layers  attention            convolutional        Compression) plus
   reuse earlier KV)     budgeting            attention            compressed attention

   E2B: 35 layers,
   first 15 own KV,                                                 mHC: aggressive per-
   final 20 reuse                                                   head output compression
   2.7 GB saved at 128K
   Per-Layer Embeddings
   ("effective" param)

   Common axis: every move targets long-context efficiency, not raw capability

Gemma 4's KV sharing is structurally the cheapest move on the cache axis: no learning, no eviction policy, just architectural reuse where later layers consume the K and V projections of earlier non-shared layers of the same attention type (sliding-window pairs with sliding-window, full attention pairs with full attention). For E2B with 35 layers, only the first 15 compute their own KV; the final 20 reuse. The capacity trade is real (the cross-layer attention paper argues it is small for the model sizes tested) but the memory savings are large at long context: ~2.7 GB at 128K on the 2B model, ~6 GB on the 4B model. Per-Layer Embeddings are the orthogonal move: small per-layer embedding tables that let the model carry token-specific information without growing the main transformer-stack compute. The "E" in E2B and E4B stands for "effective": E2B reports 2.3B effective parameters but 5.1B total counting embeddings; the compute is set by the smaller number.

DeepSeek V4's mHC is the more aggressive cousin. Where Gemma 4 shares K and V across layers, DeepSeek V4 compresses the per-head outputs along the residual stream, in addition to the DSA (DeepSeek Sparse Attention) selector that the wiki has tracked since MISA (05-11). The wiki has a Forcing-KV (05-15) summary showing that attention heads in video diffusion models cluster into static and dynamic functional roles, where static roles tolerate aggressive compression and dynamic roles do not. DeepSeek V4's mHC is empirical evidence the same head-role separation exists in text-LLM attention; what is missing is a Forcing-KV-style paper that quantifies the static-dynamic split in text-LLM heads and demonstrates that mHC's quality drop concentrates on dynamic heads.

The wiki's KV cache concept page now has three orthogonal axes of cache compression in active research: learned eviction policy (Make Each Token Count 05-12, the paper that scored each cached entry with a small projection and showed selective retention can surpass the full cache), head-role compression (Forcing-KV 05-15, applied to video diffusion), and architectural sharing (Gemma 4 today). They compose: an architecture that shares KV across some layers, evicts within each layer using a learned policy, and compresses by head role within each kept entry would multiply roughly half-by-half-by-half = eight times memory reduction at modest quality drop.

Why it matters: This is the most concentrated month of open-weight architectural change since Mixtral 8x7B. Six frontier-tier open MoEs landed in roughly two weeks (Open Artifacts #21, wiki summary). The architectural diversity is now wide enough that the MoE-muP scaling theory landing the same week becomes immediately useful: it lets the next lab choose the (M, Ne, K) tradeoffs in advance rather than picking by intuition. The CAISI (Center for AI Standards and Innovation) IRT-Elo gap analysis published the same week reports a widening open-closed gap, but Florian inside Interconnects argues a substantial fraction is harness artifact (the open models are evaluated in a strict bash-plus-token-budget setup, not in their preferred coding agent). The WildClawBench 18-point harness spread (05-15) is the canonical example of why a strict harness-naive Elo can underestimate real-world capability.

Research angle: (1) Forcing-KV-style head-role analysis for DeepSeek V4 mHC. Whether mHC compresses static heads (and so survives) or compresses dynamic heads (and so degrades on specific tasks). Falsifiable. (2) KV-sharing fraction under MoE-muP. Gemma 4 chooses roughly half. MSSP would predict an optimum. The first easy empirical falsification of either claim. (3) CAISI re-evaluation with preferred-harness control. Expect 5-10 Elo points of compression in the open-closed gap, based on WildClawBench's 18-point harness spread.

→ Full summary

MTP merges into llama.cpp: the consumer-hardware envelope of speculative decoding

Multi-Token Prediction in llama.cpp upstream. Strix Halo confirms 27B wins, 35B mixes.

Source: r/LocalLLaMA, multiple posts on 2026-05-16/17 Links: PR #22673 · Strix Halo benchmark thread · Wiki Tier: 1 (speculative decoding, on-device inference)

MTP, Multi-Token Prediction, is the speculative-decoding family where the model emits multiple draft tokens per forward pass and verifies them in a single subsequent pass. The PR that landed it in llama.cpp is the moment frontier-paper claims about speculative-decoding speedup become reproducible on consumer hardware. Strix Halo (AMD Ryzen 395 plus integrated Radeon 8060S, 96 GB unified RAM available as VRAM) benchmarks on Qwen3.6 single-file canvases tell the practical story. At 27B in 5-turn chat with ~28.5K context, wall-clock drops from 258.65s to 200.55s (-22.46% total time, -26.51% on turns 2-5), generation throughput jumps from 7.61 to 17.98 tokens per second (+136%), prompt processing drops from 254.20 to 207.87 (-18%). At 35B in single-turn 15K context the picture flips: generation rises ~17% but prompt processing drops ~16%, and the total wall-clock regresses ~11%. The crossover point sits roughly at the 30B mark and depends sensitively on multi-turn vs single-turn workload.

This is the consumer-hardware confirmation of what the wiki's speculative-decoding concept page has been tracking. Orthrus (05-14, the dual-view diffusion paper that runs an autoregressive head and a diffusion head on the same frozen LLM and the same shared KV cache, achieving 7.8x speedup with bit-identical output) is the high-end of this curve on H100. MTP in llama.cpp is the low-end. The same week saw Qwen3.6-35B-A3B with the little-coder harness hit 24.6% on Terminal-Bench 2.0, exceeding Gemini 2.5 Pro on Gemini CLI (19.6%) and Qwen3-Coder-480B on Terminus 2 (23.9%). Sub-10B models (Qwen3.5-9B at 9.2%) are now measurably on a hard agentic benchmark, not assumed unworthy of a slot. The harness-as-load-bearing thread from WildClawBench (05-15, the agent benchmark that measured an 18-point spread between the worst and best agent harness running the same model on the same 60 long-horizon tasks) is now playing out at the consumer-hardware end of the curve.

Pair with Cutile-rs (the Rust-based DSL for CUDA kernels with stable B200 support that hit beta on r/CUDA the same week). Two pieces of the inference stack moved from research to production in seven days.

Why it matters: The composition of Lighthouse Attention (05-16, training-only kernel-decoupled long-context wrapper) plus MTP (today, consumer-hardware speculative decoding) plus Forcing-KV or Make Each Token Count (cache compression) is the concrete consumer-stack that delivers the "5-10x throughput on the same hardware in 2026 over 2025 with no model change" projection from 05-16's Big Picture. Two of the three pieces are now reproducibly running on a sub-$3K Strix Halo workstation.

Research angle: (1) MTP win/loss conditions formalized. The 27B-wins-35B-mixed split depends on draft model, workload, hardware. A community-curated rule-of-thumb is 30-60 days away. (2) Selective MTP per-turn inside agentic harnesses. When the harness emits short tool-call outputs frequently, prompt-processing throughput dominates and MTP's regression there hurts. Whether little-coder or Claude-Code-style harnesses can flip MTP on for completion turns and off for tool-call turns is an integration problem worth solving.

→ Full summary

CurveBench: the visual-reasoning ceiling lowers again

Gemini 3.1 Pro reaches 71.1% on Easy and 19.1% on Hard. RLVR fine-tuning of an 8B open model lifts Easy from 2.8% to 33.3%, exceeding GPT-5.4 and Claude Opus 4.5.

Source: HuggingFace Daily Papers 2026-05-17 Links: Paper · Wiki Tier: 2 (visual reasoning benchmark, RLVR fine-tuning evidence)

CurveBench is 756 images of nested non-intersecting Jordan curves (closed planar curves that do not cross each other), annotated with the rooted tree that encodes which curves contain which. The task is structured prediction: recover the full containment tree from the image. Easy is humans-at-a-glance trivial. Gemini 3.1 Pro reaches 71.1% on Easy and 19.1% on Hard. RLVR fine-tuning of Qwen3-VL-8B, with verifiable rewards on the tree-generation task, lifts CurveBench-Easy from 2.8% (Qwen3-VL-8B-Thinking baseline) to 33.3%, exceeding GPT-5.4 and Claude Opus 4.5 under the same evaluation protocol.

This is the third VLM benchmark in a month to lower the previously-reported frontier-VLM ceiling on a visually simple task. WildTableBench (05-15, the 402-image table-reading benchmark where only one of 21 frontier multimodal models crosses 50%) made the case for tables. MemEye and MemLens (05-15, multi-session multimodal benchmarks capped below 30%) made the case for memory. CurveBench makes the case for topological structure. The pattern threshold of three crossed: the structural-representation gap in VLMs is real, the failure mode is not perception (the curves are visible), and the obvious next research move is a representational intervention that improves all three at once.

The RLVR-tuning gap is now larger than the model-capability gap. Qwen3-VL-8B-Thinking starts at 2.8% on CurveBench-Easy and ends at 33.3% after RLVR. Gemini 3.1 Pro starts at 71.1%. The 30-point RLVR jump on a single 8B model is larger than the 38-point capability gap to a multi-hundred-billion-parameter closed frontier. This is the visual-reasoning analog of SU-01's 200-RL-step math-olympiad recipe (summary, the Shanghai AI Lab paper that hit gold-medal IMO 2025 plus USAMO 2026 plus IPhO 2024-2025 with a 30B-A3B model using a reverse-perplexity SFT curriculum plus 200 RL steps).

Why it matters: Any VLM-using product that depends on accurate structural parsing of visual inputs is shipping at lower reliability than benchmark numbers suggest. The pattern of "humans easy, frontier VLMs at 20-70%" is now reproducible on three benchmarks in one month. Practitioners should treat VLM structural-parsing accuracy as a per-task RLVR problem, not a per-model capability assumption.

Research angle: (1) Cross-benchmark transfer of CurveBench RLVR. Does the Qwen3-VL-8B RLVR recipe that lifts CurveBench-Easy also lift WildTableBench or MemEye? Falsifiable in one paper. (2) ATLAS (05-15, the functional-token paper where a single discrete token serves both as agentic operation and latent visual reasoning unit) plus CurveBench. The natural eval for whether ATLAS-style latent visual reasoning generalises to topological structure.

→ Full summary

LIFE: the foundation reference for LLM multi-agent systems

200+ paper survey along four causally linked stages. Error propagation across agents is the central under-examined risk.

Source: HuggingFace Daily Papers 2026-05-17 (also @dair_ai retweet via @bayesiansapien, 05-16) Links: Paper · Wiki Tier: 2 (multi-agent systems, survey, foundation reference)

The LIFE survey organises 200+ multi-agent-system papers along four causally linked stages: Lay the capability foundation (individual agent capabilities), Integrate agents through collaboration (orchestration patterns), Find faults through attribution (failure diagnosis, in the AgentLens sense of distinguishing right-answer-right-process from Lucky Pass), Evolve through autonomous self-improvement (the EvolveMem, EvoEnv, Orchard, FrontierSmith line). The framing's load-bearing claim is causal dependency: a system that has not crossed Stage 3 (failure attribution) cannot run Stage 4 (autonomous self-improvement) without amplifying its existing failure modes. The under-examined risk the survey names is error propagation across agents and interaction rounds: in tightly coupled multi-agent systems, the failure that surfaces is rarely the agent that started it.

The wiki's multi-agent-systems concept page has tracked individual building blocks without an organising frame. LIFE supplies the frame. Three uses for the wiki: (a) LIFE's Stage 4 unifies five clusters the wiki has been tracking separately. EvolveMem (05-15, self-evolving retrieval configuration via AutoResearch on the agent's own architecture), Orchard (05-15, credit-assignment SFT learning from productive segments of unresolved trajectories), SDAR (05-15, sigmoid-gated on-policy self-distillation inside multi-turn RL), EvoEnv (05-15, verifiable RL environments with solve-verify asymmetry as the structural invariant), FrontierSmith (05-16, the open-ended coding-problem generator with idea-divergence filter), and Sylph AI (05-16 social-stream, the Worker-Evaluator-Evolution loop that automates harness construction) are all Stage 4 in the LIFE taxonomy. (b) LIFE Stage 3 names the AgentLens-style failure attribution intervention (AgentLens, 05-14, the process-aware labeling system that found 10.7% of passing SWE-bench Verified trajectories are Lucky Passes where the right answer fell out for the wrong reasons). (c) WildClawBench's 18-point harness spread (05-15) is the empirical surface where LIFE Stage 2 dependencies surface; the LIFE framing predicts that an 18-point spread implies under-developed Stage 2 standardisation across the field.

Why it matters: Survey papers are usually disposable. LIFE earns its place because the field has accumulated enough multi-agent work that the structural dependencies among the four stages are starting to bite empirically. A shared vocabulary for talking about cross-stage dependencies is a precondition for the closed-loop research the survey calls for. The wiki should adopt the LIFE taxonomy in multi-agent-systems.md and agent-memory.md.

Research angle: (1) Closed-loop LIFE benchmark. No public benchmark yet runs all four stages end-to-end as one measured loop. The cross-stage agenda the survey calls for is unbuilt. Falsifiable: a benchmark plus frontier-model evaluation that reports each stage's contribution to the next, within 90 days. (2) Lucky-Pass rate by LIFE stage. AgentLens reported 10.7% Lucky-Pass on single-agent SWE-bench Verified. LIFE predicts that rate compounds with multi-agent collaboration. Falsifiable.

→ Full summary

Industry Pulse

The May 2026 open-model wave consolidates. Interconnects' Open Artifacts #21 (Gmail-starred) catalogs six frontier-tier open MoEs in roughly two weeks: Gemma 4 (Google, Apache 2.0, 4B / 9B / 31B dense plus 26B-A4B MoE), Kimi K2.6 (moonshotai, long-horizon focused), GLM-5.1, Qwen3.6 (35B-A3B variant open), Laguna XS.2 (poolside, 33B-A3B coding-focused), MiMo-V2.5-Pro (Xiaomi, Apache 2.0), DeepSeek V4 (Pro 1.6T-A49B and Flash 284B-13B; Flash is the practitioner pick). The CAISI (Center for AI Standards and Innovation) Item-Response-Theory Elo analysis reports a widening open-closed gap; Florian inside Interconnects argues a substantial fraction is harness artifact (open models evaluated without their preferred coding agent). Epoch AI's ECI (Capabilities Index) using a different IRT methodology over different benchmarks reports the gap holds steady at 3-7 months since DeepSeek R1. Two IRT-based methodologies producing different headline numbers is the signal that methodology drives conclusion. The wiki's WildClawBench 18-point harness-spread thread (05-15) is the direct empirical surface for this dispute.
Fireworks AI training-platform updates (Gmail-starred). Kimi K2.6 full-parameter tuning with 256K context now available; GLM 5.1 LoRA RL live via Training API with SFT/DPO/full-RL on 200K context; Qwen3.6 27B fully enabled via Managed Fine-Tuning with 128K and 256K context; Gemma 4 Dense Full-Param plus LoRA RL with SFT/DPO/RL on 256K. The training-platform layer is keeping pace with the model-release cadence at 1-2 week lag, which means smaller labs can train against the May wave without owning a training stack.
MTP merged into llama.cpp upstream (PR #22673). Speculative decoding now runs on consumer hardware. Strix Halo benchmark on Qwen3.6-27B at 5-turn chat shows 22.46% wall-clock improvement; at 35B the gain inverts. The crossover sits around 30B and depends on multi-turn vs single-turn workload.
Cutile-rs beta on r/CUDA (discussion #146). Rust-based DSL for CUDA kernels with B200 Blackwell support. Author claims peak performance and cleaner syntax than Triton. The kernel-authoring layer of the AI stack is diversifying away from Python plus Triton.
Anthropic researcher @_sholtodouglas opens DMs for "when do you reach for other models instead of Claude" (@_sholtodouglas tweet). Operational signal that Anthropic is in next-model-design feedback mode. Combined with the Anthropic $900B valuation context (05-15), the $200M Gates Foundation partnership (05-15), and the 2028 US-China policy paper (05-15), Anthropic is in three coordinated motions at once: valuation, civic-infrastructure framing, technical feedback gathering.

Connecting the Dots

   Theory side (Kurate)                    Empirical side (Gmail, HF, Reddit)

   MoE-muP (Vankadara et al.)              Open-model wave (Gemma 4, DeepSeek V4,
   first principled MoE scaling rule       Laguna XS.2, Kimi K2.6, GLM-5.1,
   ─────► closed-form MSSP across           MiMo-V2.5-Pro, ZAYA1-8B)
          M, Ne, K, N, L                   Raschka catalog: KV sharing, mHC,
          covers SGD/Adam/Adafactor        compressed attention, layer budgeting
                                            CAISI gap analysis: methodology-driven

                          ▲                            ▲
                          │                            │
                          └────────────┬───────────────┘
                                       ▼
                       MSSP back-fit against published recipes
                       is one paper away. First easy falsifier:
                       Gemma 4 KV-sharing fraction vs MSSP prediction.

   Deployment substrate continuation       Eval-ceiling pattern crystallizes (3rd)

   Lighthouse Attention (05-16,            CurveBench (today): Gemini 3.1 Pro
   training-only kernel-decoupled          71.1% Easy, 19.1% Hard. RLVR on
   long-context wrapper)                   Qwen3-VL-8B: 2.8% -> 33.3%, beats GPT-5.4
              │                            and Claude Opus 4.5
              ▼                            WildTableBench (05-15): 1 of 21 above 50%
   MTP in llama.cpp (today,                MemEye / MemLens (05-15): below 30%
   speculative decoding on                 Pattern: VLM structural-representation gap
   consumer hardware)                      is real, RLVR fine-tuning closes most of it
              │
              ▼
   Cutile-rs (today, Rust DSL              Stage 4 multi-agent self-evolution unified
   for B200 kernels)                       LIFE survey (today): EvolveMem, Orchard,
              │                            SDAR, EvoEnv, FrontierSmith, Sylph AI
              ▼                            all are Stage 4. The cluster has its name.
   "5-10x throughput on same
   hardware in 2026 over 2025"
   piece-by-piece confirmed

Cross-paper thread #1: the MoE routing surface now has both empirical layers and a theoretical foundation. Vankadara et al.'s MoE-muP paper (today, Kurate cs.LG #13 with ai_rating 9.0/10, the first principled scaling theory for Mixture-of-Experts deriving closed-form prescriptions for initialization, learning rate, weight decay, and routing temperature across the five axes of expert count M, expert width Ne, routing sparsity K, network width N, and depth L) is the theoretical foundation under the empirical wave. BEAM (05-16, the paper that replaced fixed top-K MoE routing with a per-token learned binary mask trained end-to-end via straight-through estimator, with 98%+ retention at up to 85% MoE FLOP reduction) optimises the K dimension within a fixed scale. DLR (05-15, the paper that jointly trains discrete latent codes and routing policy with causally distinct ablation effects) tunes how the routing policy is learned. CaRE (05-11, the paper that adds a task-level router above existing MoE experts) addresses the meta-routing layer. RouteProfile (05-15, the paper showing structured trainable candidate-model profiles beat flat domain-level descriptions for generalisation) addresses the candidate-description axis. Four orthogonal routing layers with empirical wins, plus one theoretical recipe for how to scale the MoE architecture under all of them. The joint composition has not been written; the natural test is a single backbone family with BEAM-style per-token masks tracking DLR-style learned codes scaled per the MoE-muP MSSP recipe.

Cross-paper thread #2: the empirical-architecture wave validates and extends Tier 1 cache-compression work. Gemma 4's KV sharing (later layers reuse earlier non-shared K and V projections, halving cache size, 2.7 GB saved at 128K) is the architectural cousin of Make Each Token Count (05-12, the learned-eviction policy paper that scored each cached entry with a small projection and showed selective retention can surpass the full cache). DeepSeek V4's mHC (multi-Head Compression, aggressive per-head output compression along the residual) is the text-LLM cousin of Forcing-KV (05-15, the head-role-conditioned KV cache compression for video diffusion that found static heads tolerate aggressive pruning while dynamic heads do not, delivering 30% memory reduction at 29+ fps on H200). Three orthogonal axes of cache compression are now in active research: learned eviction (Make Each Token Count), head-role compression (Forcing-KV plus DeepSeek mHC), architectural sharing (Gemma 4). They compose multiplicatively. The missing Tier 1 paper is a Forcing-KV-style head-role characterisation for text-LLM heads, which would confirm or refute the assumption that mHC succeeds because it compresses static heads.

Cross-paper thread #3: speculative decoding crosses the consumer-hardware threshold. Orthrus (05-14, the dual-view diffusion paper that runs an autoregressive head and a diffusion head on the same frozen LLM sharing one KV cache, achieving 7.8x speedup with bit-identical output) is the high-end of speculative decoding on H100. MTP in llama.cpp (today, Multi-Token Prediction merged in PR #22673) is the low-end on Strix Halo. Cutile-rs (today, the Rust DSL for CUDA kernels with B200 Blackwell support) is the kernel-authoring layer making both reproducible. Three pieces of the inference stack moved from research to production in one week, all on the Tier 1 efficiency axis. The 05-16 Big Picture projection that the composition of Lighthouse Attention plus speculative decoding plus cache compression delivers 5-10x throughput on the same hardware in 2026 over 2025 is now piece-by-piece reproducible on a sub-$3K consumer workstation.

Cross-paper thread #4: VLM structural-representation gap is the third confirmed pattern. CurveBench (today, the nested-Jordan-curves containment-tree benchmark where Gemini 3.1 Pro reaches 71.1% Easy and 19.1% Hard) joins WildTableBench (05-15, the 402-image table-reading benchmark where only one of 21 frontier multimodal models crosses 50%) and the MemEye plus MemLens cluster (05-15, multi-session multimodal benchmarks capped below 30%). Three benchmarks in one month report the same diagnosis: the VLM failure mode on visually simple tasks is structural representation, not perception or world knowledge. The pattern threshold of three crossed. The obvious next research move is a representational intervention that improves all three at once; the empirical evidence that the gap is RLVR-learnable (Qwen3-VL-8B from 2.8% to 33.3% on CurveBench-Easy) suggests targeted post-training, not architectural change, is the lower-cost intervention.

Cross-paper thread #5: Stage 4 self-evolution has its taxonomy. The LIFE survey (today, the 200+ paper multi-agent-systems survey along four causally linked stages: Lay capability foundation, Integrate via collaboration, Find faults via attribution, Evolve via self-improvement) supplies the organising frame for five wiki clusters the wiki had been tracking separately. EvolveMem (05-15), Orchard (05-15), SDAR (05-15), EvoEnv (05-15), FrontierSmith (05-16), and Sylph AI (05-16 social-stream) are all Stage 4 in the LIFE taxonomy. The diversity within Stage 4 reflects which substrate the system evolves: data (FrontierSmith), environment (EvoEnv), retrieval configuration (EvolveMem), training procedure (Orchard, SDAR), harness (Sylph AI). LIFE makes the six-paper cluster legible as one research line. The wiki should adopt the LIFE taxonomy in multi-agent-systems.md and treat error propagation across agents and interaction rounds as the central under-examined Stage 3 risk.

Cross-paper thread #6: industry value capture continues to move to the agent harness, now with the open-model substrate confirmed. Microsoft pulling Claude Code licenses on 05-15 was a procurement decision 48 hours after WildClawBench's 18-point harness spread paper. This week the open-model wave (Gemma 4, Kimi K2.6, DeepSeek V4, Laguna XS.2) ships frontier-tier weights with Apache or permissive licensing. Combined, the picture is: the model API is a commodity, the agent harness is where the spend lands, and the open-model substrate now exists for every lab that wants to compete on harness without competing on model. Anthropic's @_sholtodouglas opening DMs for "when do you reach for other models instead of Claude" is the next-model-design feedback signal that Anthropic recognises the substrate has shifted.

Worth Watching

MoE-muP MSSP recipe back-fit against frontier MoEs. 60-90 days. The first easy empirical falsifier: take Gemma 4 26B-A4B's published KV-sharing fraction, expert count, and expert width, and check whether MSSP would have predicted those choices at that compute budget. If MSSP predicts within 10% of the empirical choices for two or more of Gemma 4, Kimi K2.6, DeepSeek V4, the recipe is real. If MSSP is systematically off, the gap is informative about which dynamics the MSSP derivation underweights.
Forcing-KV-style head-role analysis for DeepSeek V4 mHC. 60 days. Whether mHC compresses static heads (and so survives) or compresses dynamic heads (and so degrades on specific tasks) is the natural diagnostic. Falsifiable: a paper running an ablation that maps DeepSeek V4 heads to static and dynamic roles and shows mHC's quality drop concentrates on dynamic-role compression.
CAISI re-evaluation with preferred-harness control. 60 days. Florian inside Interconnects argues the open-closed Elo gap is partly a harness artifact. WildClawBench's 18-point harness spread (05-15) is the right tool for the re-evaluation. Expect 5-10 Elo points of compression in the gap, roughly half the reported difference.
MTP in llama.cpp win/loss crossover formalized. 30-60 days. The 27B-wins-35B-mixes split observed on Strix Halo depends on draft model, workload, and hardware. A community-curated rule-of-thumb on r/LocalLLaMA is the most likely deliverable, followed by selective-per-turn MTP integration in agentic harnesses (where the harness flips MTP on for completion turns and off for short tool-call turns).
Cross-benchmark VLM RLVR transfer. 60 days. Does the Qwen3-VL-8B RLVR recipe that lifts CurveBench-Easy from 2.8% to 33.3% also lift WildTableBench or MemEye? If yes, the structural representation that RLVR is implicitly training is general and a single 100K-sample RLVR run can address all three benchmarks. If no, each benchmark needs targeted post-training and the deployment story for VLMs gets fragmented.
LIFE-style closed-loop multi-agent benchmark. 90 days. No public benchmark yet runs LIFE's four stages end-to-end as one measured loop. Falsifiable: a benchmark plus frontier-model evaluation that reports per-stage contribution to the next.
LLM-rated underrated from Kurate (current week). cs.AI #11 "Hodoscope: Unsupervised Monitoring for AI Misbehaviors" by Ziqian Zhong, Shashwat Saxena, and Aditi Raghunathan (ai_rating 7.2/10, unsupervised anomaly-detection on model behavior signatures). Adjacent to today's LLM-based detection of manipulative political narratives paper (same pipeline shape: unsupervised clustering on LLM-labeled signal). cs.AI #12 "Emotion Concepts and their Function in a Large Language Model" (ai_rating 8.2/10, recurring from last week) with William Saunders and Tom Henighan; the structural absence from HuggingFace is now visible across two weeks. cs.AI #14 "Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models" (ai_rating 7.0/10) on alignment-faking surface diagnostics. cs.LG #10 "LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking" (ai_rating 6.8/10) directly relevant to RLVR pipelines including today's CurveBench fine-tuning recipe. cs.LG #11 "The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime" (ai_rating 7.8/10) on the limits of AI-auditing when errors are sparse. cs.LG #12 "LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit" (ai_rating 8.0/10, recurring) argues a single shared circuit drives both sycophancy and confabulation; pairs against the mechanistic-interpretability cluster from 05-16 morning that argued circuits are non-unique (the "All Circuits Lead to Rome" thread). Two papers in tension: this claim says one circuit; the 05-16 cluster says many circuits. Worth resolving. cs.LG #13 "How to Scale Mixture-of-Experts" today's Tier 1 Deep Dive (ai_rating 9.0/10).
Rising authors from Kurate. No authors crossed threshold this week. Threshold review remains a connectors/kurate/farmer.py calibration question; the past three runs have produced zero crossings.
Cross-source confirmation (HF + Kurate). Today's HuggingFace top and the current Kurate cs.AI and cs.LG weeklies have no direct overlap. The cross-source-confirmed Tier 1 promotion rule did not fire this run.

Quick Hits

RAVEN with CM-GRPO (arXiv 2605.15190). Real-time autoregressive video extrapolation with consistency-model GRPO (Group Relative Policy Optimization, the lightweight RL recipe most reasoning post-training pipelines now use). The training-time framework repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states so training attention aligns with inference-time extrapolation. CM-GRPO reformulates a consistency-sampling step as a conditional Gaussian transition and applies online RL directly to that kernel, avoiding the Euler-Maruyama auxiliary process used in prior flow-model RL formulations. Tier 3 video, but the CM-GRPO formulation is portable to other consistency-distilled few-step generation pipelines.

CLVR (Closed-Loop Visual Reasoning) (arXiv 2605.14876). USTC team proposes a closed-loop text-to-image generation framework with three pieces: an automated data engine with step-level visual verification, Proxy Prompt Reinforcement Learning (PPRL) which distills interleaved multimodal histories into explicit reward signals for long-context optimization stability, and Delta-Space Weight Merge (DSWM) which fuses alignment weights with off-the-shelf distillation priors to drop per-step inference cost to 4 NFEs. Tier 3, but the closed-loop verified-reasoning structure is the visual-side analog of Stage 3 plus Stage 4 in the LIFE taxonomy.

Warp-as-History (arXiv 2605.15182). Frozen video generation model gains zero-shot camera-trajectory control by feeding camera-warped pseudo-history through the model's visual-history pathway. No training, no architectural modification, no test-time optimization. A small offline LoRA fine-tune on a single camera-annotated video generalises to unseen videos. Tier 3 vision; structurally interesting as another example of training-free wrapper-driven capability extraction from frozen models.

Re-surfaced from earlier batches. The HF feed for 2026-05-17 carries 53 entries, of which the substantive content for BEAM, Lighthouse Attention, ATESD, LiSA, FrontierSmith, SPIN, DLR, RouteProfile, Forcing-KV, WildClawBench, Orchard, SDAR, EvoEnv, SU-01, Darwin Family, ATLAS, RewardHarness, OmniBoost, IntentVLA, FEST, PRISM, DiffusionOPD, MemEye, MemLens, STALE, Preping, BOOKMARKS, EvolveMem, WildTableBench, Causal Forcing++, SANA-WM was already covered in the 2026-05-15 and 2026-05-16 digests. Treat those entries as continuing-thread context, not new ingestions.

Ideology Prediction of German Political Texts (arXiv 2605.14352). Transformer-based regression on left-to-right political-orientation scalar. DeBERTa-large F1=0.844 in-domain, ACC=0.864 on Twitter out-of-domain; Gemma2-2B MAE=0.172 on newspaper out-of-domain. Companion to today's LLM-based detection of manipulative political narratives paper; two papers this day formalise automated political-content classification with transformer stacks. Tier 2 responsible-ai.

Pace-and-Path Correction for VLA models (arXiv 2605.11459). Training-free, closed-form inference-time operator that wraps any chunked-action Vision-Language-Action model to handle non-stationary dynamics that single-frame-conditioned VLAs fail on. Decomposes into a pace channel (compresses execution along the planned direction) and a path channel (orthogonal spatial offset). Tier 4 robotics; structurally interesting as a clean closed-form wrapper.

PanoWorld / PhyMotion / Realiz3D / SAT3DGen / VGGT-Edit. 3D and panoramic world models. Tier 4; skip.

FutureSim and Nexus. Abstracts still thin on the HF feed entries. FutureSim (arXiv 2605.15188) replays world events to evaluate adaptive agents; Nexus (arXiv 2605.14389) is an agentic framework for time-series forecasting. Skip until detailed methodology surfaces.

Topology-Preserving Neural Operator via Hodge Decomposition (arXiv 2605.13834). Hodge-orthogonality-based decomposition for neural operators on geometric meshes. Tier 4 (physics-informed ML), notable as one of the few hardcore mathematical results in today's HF feed.

Aligning Latent Geometry for Spherical Flow Matching (arXiv 2605.15193). Latent flow matching for image generation usually transports Gaussian noise to VAE latents along linear paths; the paper finds both endpoints concentrate in thin spherical shells and proposes geodesic (SLERP) paths. Tier 3 image generation, useful for image generators that already use latent flow matching.

Sources ingested today: HF (53 papers, of which 6 genuinely new beyond yesterday's batch and the rest re-surfaced from 05-15 and 05-16), Gmail (3 starred: Interconnects Open Artifacts #21, Fireworks training-platform update, Sebastian Raschka architecture survey), RSS (no new files; latest is 05-15), Twitter morning slot (2 sparse AI-handle tweets, 0 retweets, 0 articles), Kurate cs.AI plus cs.LG weekly leaderboards (no rising authors, no HF cross-source confirmation, but cs.LG #13 surfaces today's Tier 1 Deep Dive), Reddit (8 subs scraped; r/LocalLLaMA and r/CUDA produced substantive content, others empty), parallel Daily-Digest (no file for 2026-05-17 in /Users/amitsinghbhatti/Documents/Claude/Projects/Daily-Digest/) | Wiki pages updated: 7 summaries (MoE-muP, Raschka architecture survey, MTP in llama.cpp, Open Artifacts #21, LIFE survey, CurveBench, LLM-based political-narrative detection); concept pages will be updated in a follow-up pass since today's content extends the kv-cache and llm-routing pages most heavily.