May 18, 2026 · daily digest

cere-bro | 2026-05-18

cere-bro | 2026-05-18

Today's batch is the deployment-and-design Monday after a theory-heavy weekend. HodgeCover supplies the mathematically precise reason every prior learning-free MoE compressor caps out at moderate compression. AIRA shows LLM agents can autonomously discover Transformer-Mamba hybrid architectures that scale 54% faster than Llama 3.2. Two RLVR papers, CIPO and NudgeRL, attack the same sparse-reward weakness from opposite sides. The Twitter morning slot is empty for the second consecutive Monday, but Reddit and yesterday's Gmail-starred DAIR.AI weekly fill the gap with confirmation of last week's threads.


TL;DR


The Big Picture

The wiki's running MoE thread now has both forward and inverse directions in one week. On 2026-05-17 the Kurate cs.LG #13 paper (MoE-muP from Vankadara et al. at Gatsby UCL, the first principled scaling theory for Mixture-of-Experts deriving closed-form prescriptions for initialization, learning rate, weight decay, and routing temperature across the five MoE axes: number of experts M, expert width Ne, routing sparsity K, network width N, depth L) gave the forward recipe: how to scale a new MoE without sweeping. Today, HodgeCover gives the inverse: how to compress an existing MoE without retraining. The mathematical machinery is different (Dynamical Mean Field Theory for MoE-muP, simplicial Laplacians and Hodge decomposition for HodgeCover) but the design surface is shared. Both papers operate on the five-axis MoE space the wiki's llm-routing concept page and the BEAM / DLR / CaRE / RouteProfile cluster have been mapping for the last fortnight. The composition that has not yet been written: a frontier MoE pre-trained under MoE-muP's MSSP recipe, deployed under BEAM's per-token binary mask (2026-05-16, the paper that replaced fixed top-K MoE routing with a per-token learned binary mask trained end-to-end via straight-through estimator, achieving 98%+ retention at up to 85% FLOP reduction), then post-training compressed under HodgeCover's harmonic-coverage objective. Three orthogonal MoE knobs, three principled methods, one stack.

The second thread is RLVR moving from "what does the reward signal say" to "what should the policy do with each piece of that signal." CIPO and NudgeRL on the same day approach this from opposite directions. NudgeRL changes what gets explored by conditioning each rollout on a lightweight strategy-level context. CIPO changes how the failed rollouts get reused by converting them into correction-oriented supervision. Both attack the central RLVR weakness named by the 2026-04-21 paper on RLVR weak supervision (the paper that argued RLVR mostly redistributes probability mass over already-discovered correct answers rather than expanding intrinsic reasoning capacity). CIPO's pass@K-over-pass@1 gain is the cleanest counter-evidence to that critique to date. Whether the pass@K gain replicates on independent benchmarks like AIME 2026 is the load-bearing test for the next 30-60 days.

The third thread is the deployment-calibration gap surfacing in two different domains the same day. DiagnosticIQ on the industrial-maintenance side reports the frontier-LLM cluster within one Macro point, with brittleness as the discriminating axis: 49-63% of frontier-model answers persist under condition inversion. PAGER on the GUI side reports 88% action-type accuracy versus under 6% task success on precision-sensitive geometric tasks. Both papers identify that headline benchmark accuracy and deployment-relevant capability have decoupled along a structural axis. This is the same pattern WildClawBench named on 2026-05-15 (the agent benchmark that measured an 18-point spread between best and worst agent harness running the same model on the same 60 long-horizon tasks) and CurveBench named on 2026-05-17 (the nested-Jordan-curves benchmark where Gemini 3.1 Pro reaches 71.1% Easy and 19.1% Hard, with RLVR lifting an 8B open model from 2.8% to 33.3% on Easy). Four benchmarks in a week reporting the same structural decoupling is past the pattern threshold. The 2026-05-17 digest predicted the field would converge on representational interventions; PAGER and DiagnosticIQ confirm that the gap is post-training-learnable but the per-domain calibration is non-transferable.


Deep Dives


HodgeCover: harmonic-kernel obstructions in learning-free MoE compression

Every prior learning-free MoE compressor scores experts on pairwise compatibility, so every prior method is structurally blind to triples where pairwise mergeable experts form an irreducible cycle when merged together.

Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 1 (MoE compression, inference efficiency)

   Prior learning-free MoE compressors        HodgeCover (today)
   ──────────────────────────────────         ─────────────────────────
   Score experts on pairwise edges            Build 2-complex on experts:
   (KL merge barrier between i and j)           vertices = experts
   Greedy: keep low-barrier edges               edges = pairwise KL barriers
                                                faces = triplet barriers
   Blind spot: triangle of pairwise-          Hodge-decompose the edge-barrier
   mergeable experts can be a non-              signal into gradient (hierarchical),
   mergeable triple. The obstruction            curl (cyclic), and harmonic
   is invisible to any pairwise score.          (structural obstruction) components

                                              Greedy coverage of harmonic-critical
                                              edges and triangles is the selection
                                              objective. Hybrid variant adds weight
                                              pruning on survivors.

The paper makes one mathematically precise and one structurally consequential claim. The mathematical claim: the obstruction blocking aggressive learning-free MoE compression is exactly the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers between expert pairs, and whose faces carry triplet merge barriers. Hodge decomposition splits the edge-barrier signal into three orthogonal components, and the harmonic component is the part no pairwise method can see. The structural claim: turning this diagnostic into a greedy coverage objective produces a compressor that ties prior methods at moderate compression and wins at aggressive compression by 5-15% (the wiki's prediction, the paper reports the headline numbers on three open-weight Sparse MoE backbones).

The wiki's MoE compression and routing thread now has its theoretical complement to BEAM (2026-05-16, the paper that replaced fixed top-K MoE routing with a per-token learned binary mask trained end-to-end via straight-through estimator and reported 98%+ retention at up to 85% MoE FLOP reduction with a custom vLLM kernel that exploits the binary structure). BEAM optimises the active-experts-per-token count. HodgeCover optimises the resident-experts-per-layer count. They attack orthogonal axes of MoE serving cost: BEAM reduces FLOPs, HodgeCover reduces parameter memory.

Together with MoE-muP (2026-05-17, Kurate cs.LG #13, the Vankadara et al. paper deriving the first principled scaling theory for Mixture-of-Experts as the Maximally Scale-Stable Parameterization across all five MoE axes, ai_rating 9.0/10), the three papers cover the forward direction (MoE-muP: how to choose M, Ne, K, N, L for a new MoE), the deployment direction (BEAM: how to activate experts per token), and the post-training compression direction (HodgeCover: which experts to keep). The empirical wave that needs these results is visible: Gemma 4 26B-A4B, DeepSeek V4 Pro 1.6T-A49B and Flash 284B-13B, Kimi K2.6, Qwen3.6 35B-A3B, Laguna XS.2 33B-A3B, MiMo-V2.5-Pro all ship MoE under different (M, Ne, K) tradeoffs without a principled recipe. Raschka's 2026-05-17 architecture survey (Gmail-starred, the post cataloguing Gemma 4's KV sharing plus per-layer embeddings, Laguna XS.2's layer-wise attention budgeting, ZAYA1-8B's compressed convolutional attention, and DeepSeek V4's mHC) is the empirical census. Today is the post-training compression piece.

Why it matters: The aggressive-compression frontier matters because that is where deployment under hardware constraints lives. At moderate compression every reasonable scoring method ties. At aggressive compression the pairwise-blind methods break and HodgeCover does not. The 5-15% expected gain compounds with BEAM's per-token compression and MoE-muP's pre-training scaling.

Research angle: Three open problems. (1) Apply HodgeCover to the open-model wave. Falsifiable: run HodgeCover on Gemma 4 26B-A4B, DeepSeek V4 Flash, Kimi K2.6, Qwen3.6 35B-A3B. If the aggressive-frontier win is over 5% on three of four, the harmonic obstruction is systemic and frontier MoEs have been leaving large memory savings on the table. (2) Higher-order obstructions. The paper analyses the 2-complex (triangles). Whether 3-complexes (tetrahedra) reveal additional obstructions at very wide MoE (M >= 64) is the natural extension. (3) Hodge-aware routing. Make BEAM's per-token activation aware of harmonic-critical triples. The composition would be the first MoE serving stack with structural awareness at both the per-token activation step and the per-layer expert-set step.

Full summary


AIRA-Compose and AIRA-Design: LLM agents discover scaling-frontier architectures

11 LLM agents in a 24-hour compute budget discover Transformer-Mamba hybrids that scale 54% faster than Llama 3.2 on the compute-optimal frontier.

Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 2 (architecture search, agentic systems, hybrid SSM/Transformer)

   AIRA-Compose                                AIRA-Design
   ────────────────────────                    ──────────────────────────
   11 agents                                   Up to 20 agents
   24-hour compute budget                      Write attention mechanism code
   Combinatorial design space:                 Write training scripts
     Attention, MLP, Mamba                     Invent operators, not just compose

   Two-stage search:                           Direct mechanistic synthesis
     1. Million-param candidates
     2. Top designs extrapolated
        to 350M, 1B, 3B

   Output: 14 novel architectures
     AIRAformers (Transformer-based)
     AIRAhybrids (Transformer-Mamba)

   At 1B pre-trained, fixed tokens:
     AIRAformer-D: +2.4% over Llama 3.2
     AIRAhybrid-D: +3.8% over Llama 3.2

   Compute-optimal scaling slope:
     AIRAformer-C: 54% / 71% faster than Llama 3.2 / best Composer Transformer
     AIRAhybrid-C: 23% / 37% faster than modified Nemotron-2 / Composer hybrid

The headline is not the 2.4-3.8% accuracy gain. The headline is the scaling-frontier slope. A 54% faster compute-optimal slope at 1B compounds across the roughly 100x compute multiplier between research scale and frontier scale. If the slope generalises, a frontier-scale AIRAformer pre-train would land in the regime that today requires roughly 1.5x more compute on the Llama family.

The agentic-discovery framing is the second contribution. The wiki's prior 2026-04-28 entry on the Hope architecture (the nested-learning architecture paper that beat Transformer at small scale through a hand-designed mechanism) was the most recent novel architecture in the wiki to beat Transformer. AIRA is the first wiki entry where the novel architecture beating Transformer was designed by LLM agents, not humans. This composes with the multi-agent self-evolution thread the LIFE survey (2026-05-17, the 200+ paper multi-agent survey organising work along Lay-Integrate-Find-Evolve stages) named as Stage 4. AIRA-Design is a Stage 4 system applied to the substrate of architecture itself: the agents evolve the model that will eventually be them.

Pair this against Raschka's 2026-05-17 catalog of human-designed moves shipping in the May open-model wave (Gemma 4's KV sharing, Laguna XS.2's layer-wise attention budgeting, ZAYA1-8B's compressed convolutional attention, DeepSeek V4's mHC). Every move in his catalog is convergent: independent teams arrived at similar architectures by hand. AIRA proposes that the same convergence is reachable by automated search at non-trivial scale. If the 54% slope is real at 8B and 30B, the cost of architectural discovery drops by roughly an order of magnitude for any lab with agentic compute.

Why it matters: Architecture search has been on the research backbench for two years because hand-designed architectures kept winning. If AIRA's scaling-frontier slope replicates at 8B and 30B, the backbench moves back into the central rotation. The economic logic mirrors MoE-muP for MoEs: principled discovery is cheaper than expensive empirical sweep.

Full summary


CIPO and NudgeRL: the two ends of RLVR sparse-reward

Two RLVR papers, same day, opposite ends. NudgeRL changes what gets explored. CIPO changes how the failures from exploration get reused.

Source: HuggingFace Daily Papers 2026-05-18 Links: CIPO Paper · CIPO Wiki · NudgeRL Paper · NudgeRL Wiki Tier: 2 (RLVR, reasoning)

   Standard RLVR weakness                     CIPO                          NudgeRL
   ──────────────────────                     ────────────────              ──────────────────
   Sparse binary reward.                      Failed rollouts mined         Each rollout conditioned
   Failed rollouts: gradient signal           and converted into            on lightweight strategy-
   is uniformly negative, no                  correction-oriented           level context (e.g. "try
   credit assignment to which                 supervision. Model            case-by-case", "find an
   step caused failure.                       attends failed prefix         invariant"). Reward signal
                                              and emits correction          decomposes into inter- and
                                              continuation. No oracle,      intra-context components.
                                              no critic, no PRM.            Distillation pushes useful
                                                                            strategies into base policy.

   Brute-force fix: bigger                    Pass@K gain > pass@1          Matches vanilla GRPO at 8x
   rollout budgets.                           gain. Capacity expansion,     larger rollout budget. Strategy
                                              not probability redistr.      pool is fixed lightweight.

CIPO is the cleanest counter-evidence in the wiki to the 2026-04-21 critique of RLVR (the wiki's standing reference, the paper that argued RLVR mostly redistributes probability mass over already-discovered correct answers rather than expanding the model's intrinsic reasoning capacity). CIPO's pass@K-over-pass@1 gain is the specific empirical pattern that critique predicted should be absent. Whether the pattern survives independent replication on AIME 2026 is the load-bearing test.

NudgeRL's 8x rollout-budget equivalence is the efficiency claim. Under matched final accuracy, strategy-conditioned exploration uses roughly an order of magnitude less compute than brute-force rollout scaling. The strategy pool is hand-crafted lightweight scaffolds, not oracle hints. The deployed inference-time policy is the distilled base, not the strategy-conditioned model.

The natural composition: NudgeRL diversifies the rollouts, CIPO recycles the failures. The first paper to compose them is one experiment away. Both inherit the standard RLVR reward-hacking risk that the wiki has been tracking via the Kurate cs.LG #10 paper from this week (LLMs Gaming Verifiers, ai_rating 6.8/10, showing RLVR pipelines can be reward-hacked when the policy learns to game the verifier rather than solve the task). Whether more aggressive use of verifier signal (CIPO) and more diverse exploration (NudgeRL) amplifies or dampens reward hacking is the next falsifier.

CIPO full summary · NudgeRL full summary


Solvita: agentic evolution for code with an adversarial Hacker

Four-agent code framework where the Hacker constructs adversarial tests that update routing weights. Nearly doubles single-pass accuracy.

Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 2 (agentic systems, self-evolution, code)

          Planner ◄──┐                         Outcome signals:
            │        │                         ─────────────────
            ▼        │                         Pass/fail verdict ───► Planner, Solver
          Solver ────┼──► Verifier             Test cert. quality ───► Oracle
            │        │                         Adversarial vuln. ───► Hacker, Solver
            ▼        │
          Oracle ────┘
            │
            ▼
          Hacker (constructs adversarial tests)
            │
            └─────► RL updates on each agent's graph-structured knowledge network

The Hacker is the structural innovation. Most prior multi-agent code systems used a critic or verifier that grades a proposed solution post hoc. Solvita's Hacker actively constructs adversarial test cases that target the Solver's likely failure modes, and the resulting attack patterns are stored as RL signal on the Hacker's network. The base LLM is frozen throughout; all learning happens in the per-agent graph-structured knowledge networks.

This is the most concrete Stage 4 system in the wiki under the LIFE survey taxonomy (the 200+ paper multi-agent survey from 2026-05-17 that organised the field along Lay capability foundation, Integrate via collaboration, Find faults via attribution, Evolve via self-improvement). Solvita integrates four specialised agents (Stage 2), uses the Hacker to attribute failures (Stage 3), and evolves the routing graphs from those failures (Stage 4). The closed loop is functional and not theoretical.

Compare against Conductor (2026-05-11, Sakana AI's ICLR 2026 paper that trained a 7B RL orchestrator to invoke frontier models GPT-5, Claude Sonnet 4, Gemini 2.5 Pro at roughly 3 calls per question and beat every individual model on GPQA-D / LiveCodeBench / AIME25). Conductor and Solvita are both routing systems, both trained by RL, both target frontier-model performance with small-model orchestration. Conductor routes between heterogeneous frontier models; Solvita routes between homogeneous agent roles. Conductor's 3-calls-per-question budget is roughly equivalent to Solvita's four-agent overhead. Whether per-role specialisation (Solvita) or per-model routing (Conductor) is the better economic structure is the cross-paper question.

Full summary


PAGER and DiagnosticIQ: deployment-calibration is now a structural axis, not a model axis

88% action-type accuracy versus under 6% task success in GUI agents. Top three LLMs within one Macro point on industrial-rule benchmarks but 49-63% original-answer-rate under condition inversion. Two benchmarks today, same pattern.

Source: HuggingFace Daily Papers 2026-05-18 Links: PAGER Paper · PAGER Wiki · DiagnosticIQ Paper · DiagnosticIQ Wiki Tier: 2 (deployment, responsible AI, agentic systems)

PAGER reframes the GUI-agent metric. The dominant region-tolerant paradigm scores agents on action-type accuracy: did the click land inside the right component? On precision-sensitive geometric tasks, where actions must land on specific points in continuous canvas space, this is the wrong metric. PAGE Bench (4,906 problems, 224K process-supervised pixel-level actions) reports that general multimodal models exceed 88% action-type accuracy but stay under 6% task success. The Semantic-Execution Gap is between knowing what action to take and executing it precisely enough that downstream geometry-dependent steps still work. PAGER closes the gap to 4.1x the strongest baseline via dependency-structured planning plus precision-aligned RL with state-conditioned geometric feedback.

DiagnosticIQ runs the same diagnostic at a different layer. On 6,690 industrial-maintenance multiple-choice questions, the top three frontier LLMs land within one Macro point. The capability axis has flattened. The brittleness axis has not. Every model loses 13-60% relative accuracy under distractor expansion (DiagnosticIQ Pro). Under condition inversion (DiagnosticIQ Aug), 49-63% of frontier-model answers persist as the original answer. In other words, the model recognises the surface pattern of the rule rather than its logical content.

Read these two papers next to WildClawBench (2026-05-15, the agent benchmark that measured an 18-point spread between best and worst agent harness running the same model on the same 60 long-horizon tasks) and CurveBench (2026-05-17, the nested-Jordan-curves benchmark where Gemini 3.1 Pro reaches 71.1% Easy and 19.1% Hard, with RLVR lifting Qwen3-VL-8B from 2.8% to 33.3% on Easy). Four benchmarks in five days identify the same structural pattern: headline accuracy and deployment-relevant capability have decoupled along a structural axis. The pattern threshold of three was crossed earlier in the week; today brings the count to four. The 2026-05-17 digest predicted the field would converge on representational interventions. PAGER and DiagnosticIQ confirm the gap is post-training-learnable (PAGER closes it via state-conditioned RL, similar in spirit to CurveBench's RLVR result) but suggests the calibration is per-domain.

PAGER full summary · DiagnosticIQ full summary


FashionChameleon: KV cache rescheduling as a fourth axis

Training-Free KV Cache Rescheduling for interactive multi-condition video. Composes garment KV refresh, historical KV withdraw, reference KV disentangle.

Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 1 (KV cache, video generation)

The wiki's KV cache thread has been mapping three orthogonal compression and routing axes. Make Each Token Count (2026-05-12, the paper that scored each cached entry with a small projection and showed selective retention can surpass the full cache) introduced learned eviction. Forcing-KV (2026-05-15, the paper that found attention heads in autoregressive video diffusion cluster into static and dynamic functional roles, where static roles tolerate aggressive compression and dynamic roles do not) introduced head-role compression. Gemma 4's KV sharing (surveyed in Raschka's 2026-05-17 post, where later layers reuse earlier layers' K and V projections, halving cache size at 128K context) introduced architectural sharing.

FashionChameleon adds a fourth axis: content-aware KV cache rescheduling. When a user switches a garment mid-video, the cache must simultaneously refresh entries pertaining to the new garment, withdraw entries that encoded the old garment, and disentangle entries that came from reference inputs versus generated history. This is not eviction (which entry to drop), not compression (how many bits per entry), and not sharing (which layers reuse). It is the cache-as-state-machine view: which entries belong to which causal-conditioning thread.

The four axes compose. The 2026-05-17 digest projected that a stack combining learned eviction, head-role compression, and architectural sharing would multiply roughly to 8x memory reduction. Adding rescheduling extends the stack to multi-conditioning scenarios. The wiki has no entry yet for KV cache rescheduling in text. FashionChameleon is the candidate template.

Why it matters: Multi-conditioning workloads (multi-document RAG, multi-tool agentic conversations, multi-reference video) are now the dominant deployment shape. A cache management discipline that explicitly tracks which conditioning thread each entry serves is what those workloads need.

Research angle: Whether the three-mechanism rescheduling (refresh / withdraw / disentangle) generalises to text multi-conditioning is the natural test. Falsifiable: implement the same three operations in a multi-document RAG harness and measure cache efficiency versus a single-bucket baseline.

Full summary


Industry Pulse


Connecting the Dots

   MoE design surface, all three directions in one week

   MoE-muP (2026-05-17, theory)            HodgeCover (2026-05-18, compression)
   ─────────────────────────────           ─────────────────────────────────
   Closed-form MSSP across                 Harmonic-kernel obstruction in
   M, Ne, K, N, L axes for                 simplicial Laplacian on expert
   scale-stable MoE pre-training           2-complex. Greedy harmonic
   (Vankadara et al., Kurate               coverage selects which experts
   cs.LG #13 ai_rating 9.0/10)             to keep at aggressive compression
                  │                                          │
                  ▼                                          ▼
   BEAM (2026-05-16, per-token activation)
   Binary expert-activation masks trained end-to-end via straight-through estimator.
   Decides which experts to activate per token. 98%+ retention at 85% FLOP reduction.
   ────────────────────────────────────────────────────────────────────────────────
   Composition not yet written: MSSP-scaled MoE pre-train + BEAM per-token activation
   + HodgeCover post-training compression = full three-axis MoE serving stack.

   RLVR sparse-reward surface, both ends in one day

   NudgeRL (today)                         CIPO (today)
   ────────────────────                    ──────────────────
   Changes WHAT gets explored.             Changes HOW failures are reused.
   Strategy-level context per              Failed trajectories converted into
   rollout, distillation back to           correction-oriented supervision.
   base policy. Matches GRPO at 8x         Pass@K gain > pass@1 gain.
   smaller rollout budget.                 Counters RLVR weak-supervision critique
                                           (2026-04-21) directly.
                  │                                          │
                  └─────────── compose: nudge exploration, recycle failures ──────┘
                                Single experiment away. Diagnostic: pass@K gain
                                under composition versus sum of individual gains.

   Deployment-calibration gap, fourth confirmation

   WildClawBench (2026-05-15)              CurveBench (2026-05-17)
   ─────────────────────────               ────────────────────────
   18-point harness spread on              Gemini 3.1 Pro 71.1% / 19.1%.
   60 long-horizon tasks                   RLVR on 8B model: 2.8% -> 33.3%
                  │                                          │
                  └──────────► PAGER (today) + DiagnosticIQ (today) ◄──────────
                  88% action-type / <6% task success on GUI geometry
                  Top-3 LLMs within 1 Macro point, 49-63% original-answer
                  rate under condition inversion on industrial rules
                  ────────────────────────────────────────────────────
                  Four benchmarks in five days. The structural decoupling
                  between headline accuracy and deployment-relevant capability
                  is past pattern threshold. RLVR closes it per-domain.

Cross-paper thread #1: the MoE design surface is now mapped end-to-end. This is the most concentrated week the wiki has tracked on MoE scaling. MoE-muP (2026-05-17, Vankadara et al. at Gatsby UCL plus Amazon plus Tübingen, Kurate cs.LG #13 ai_rating 9.0/10, the first principled scaling theory for Mixture-of-Experts deriving closed-form MSSP prescriptions for initialization, learning rate, weight decay, and routing temperature across the five MoE axes M, Ne, K, N, L using Dynamical Mean Field Theory) gives the forward direction: how to scale a new MoE. BEAM (2026-05-16, the paper that replaced fixed top-K routing with a per-token learned binary mask trained end-to-end via straight-through estimator, achieving 98%+ retention at up to 85% FLOP reduction with a custom vLLM kernel for the binary structure) gives the per-token activation direction: which experts to run for this token. HodgeCover (today, the paper that models the obstruction blocking aggressive learning-free MoE compression as the harmonic kernel of the simplicial Laplacian on a 2-complex of experts and uses Hodge decomposition to identify the harmonic-critical edges and triangles) gives the post-training compression direction: which experts to keep. Three orthogonal MoE knobs, three principled methods, all in one week. The empirical wave that needs these is visible: Gemma 4 26B-A4B, DeepSeek V4 Pro 1.6T-A49B and Flash 284B-13B, Kimi K2.6, Qwen3.6 35B-A3B, Laguna XS.2 33B-A3B, MiMo-V2.5-Pro. Six frontier-tier open MoEs are now in the wild, all designed by hand without a principled recipe. The next-generation training pipeline that combines MoE-muP scaling, BEAM activation, and HodgeCover compression is a single integration paper away.

Cross-paper thread #2: the RLVR sparse-reward weakness gets both an exploration fix and a credit-assignment fix on the same day. RLVR Weak Supervision (2026-04-21, the wiki's standing reference paper that argued RLVR mostly redistributes probability mass over already-discovered correct answers rather than expanding the model's intrinsic reasoning capacity) has been the open critique the wiki tracked for a month. CIPO (today, the paper that mines on-policy failed trajectories and converts them into correction-oriented supervision by pairing each failed prefix with a correct continuation derived from the same model's adjacent success rollouts) and NudgeRL (today, the paper that conditions each rollout on a lightweight strategy-level context drawn from a fixed pool and decomposes the reward into inter- and intra-context components, with a distillation pass that pushes useful strategies into the base policy, matching vanilla GRPO at 8x larger rollout budgets) are the two complementary answers. CIPO's pass@K-over-pass@1 gain is the cleanest empirical counter to the RLVR Weak Supervision critique to date. The compose-CIPO-with-NudgeRL experiment is one paper away.

Cross-paper thread #3: the deployment-calibration gap crosses four confirmations. WildClawBench (2026-05-15) was the first. CurveBench (2026-05-17) was the second. PAGER and DiagnosticIQ today bring the count to four benchmarks in five days reporting the same pattern: headline accuracy and deployment-relevant capability have decoupled along a structural axis. The decoupling is not random. It is consistent across GUI agentic execution (PAGER), industrial-rule reasoning (DiagnosticIQ), agent harness selection (WildClawBench), and visual structural reasoning (CurveBench). The 2026-05-17 digest predicted the field would converge on representational interventions. PAGER and DiagnosticIQ confirm the gap is post-training-learnable; PAGER closes its gap with precision-aligned RL and state-conditioned geometric feedback, similar in spirit to CurveBench's RLVR result. The remaining open question is whether a single representational intervention closes the gap across all four domains or whether each requires per-domain post-training.

Cross-paper thread #4: agentic self-evolution at the architecture level. AIRA-Compose and AIRA-Design (today) are the first wiki entries where LLM agents discover neural architectures that scale faster than hand-designed baselines at non-trivial parameter count. This composes with the multi-agent self-evolution cluster that LIFE (2026-05-17, the 200+ paper multi-agent survey organising work along Lay-Integrate-Find-Evolve stages) unified yesterday. EvolveMem (2026-05-15, retrieval-configuration self-evolution), Orchard (2026-05-15, credit-assignment SFT), SDAR (2026-05-15, sigmoid-gated on-policy self-distillation), EvoEnv (2026-05-15, environment synthesis), FrontierSmith (2026-05-16, open-ended problem generation), Sylph AI (2026-05-16 social-stream, harness construction), Solvita (today, agent-role routing graphs) all evolve different substrates. AIRA evolves the substrate of architecture itself. The cluster is now eight papers; if the trend continues, agentic self-evolution is the dominant Stage 4 form by July.

Cross-paper thread #5: KV cache rescheduling joins the cache-management taxonomy. FashionChameleon (today) extends the wiki's KV cache thread from three orthogonal compression-and-routing axes (learned eviction via Make Each Token Count 2026-05-12, head-role compression via Forcing-KV 2026-05-15, architectural sharing via Gemma 4 surveyed by Raschka 2026-05-17) to a fourth: content-aware rescheduling for multi-conditioning workloads. The four-axis stack is now mapped. The text-side rescheduling generalisation is the missing piece.


Worth Watching


Quick Hits

Flash-GRPO (arXiv 2605.15980). Single-step training framework for aligning video diffusion models via Group Relative Policy Optimization (GRPO, the lightweight on-policy RL recipe). Two contributions: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency; temporal gradient rectification neutralises the time-dependent scaling factor that produced inconsistent gradient magnitudes across timesteps. Validated at 1.3B to 14B parameters. Tier 3 video alignment; structurally interesting as the video-diffusion analog of the RLVR variance-control work the wiki has tracked through Balanced Aggregation GRPO (2026-05-09).

InsightTok (arXiv 2605.14333). Discrete visual tokeniser with localised content-aware perceptual losses for text and face fidelity in autoregressive image generation. 16k codebook, 16x downsampling. Tier 3 image generation.

DepthVLM (arXiv 2605.15876). Attaches a lightweight depth head to a VLM's LLM backbone and trains under unified vision-text supervision in a two-stage schedule, producing full-resolution depth maps alongside language outputs in a single forward pass. Tier 3 multimodal; one of the cleaner unified-vision-language extensions.

Look Before You Leap (arXiv 2605.16143). Already cross-referenced in the RLVR thread. Introduces Exploration Checkpoint Coverage as a verifiable exploration metric independent of task reward, then trains agents with interleaved task and exploration rollouts under the Explore-then-Act paradigm. Tier 2 agentic systems. → Full summary

MMSkills (arXiv 2605.13527). Multimodal skill packages (text procedure plus state cards plus multi-view keyframes) for visual agents, with a branch-loaded inference pattern that keeps heavy multimodal evidence on a side branch. Tier 2. → Full summary

DexJoCo, FFAvatar, OmniHumanoid, ReactiveGWM, WorldAct. Robotics dexterous manipulation benchmark, few-shot 3D Gaussian avatar reconstruction, cross-embodiment humanoid video generation, reactive game-NPC world model, static-to-interactive 3D world activation. Tier 4 across the board. Skip.


Sources ingested today: HF (18 papers; 7 substantive at Tier 1 or Tier 2, 3 at Tier 3, 5 at Tier 4 skipped), Gmail (3 starred: Semiconductor Newsletter Week 20, DAIR.AI weekly top papers, Gary Marcus video roundup), RSS (no new file for 2026-05-18; latest is 2026-05-17, content already integrated into yesterday's digest), Twitter morning slot (2 sparse AI-handle tweets from @MillionInt, 0 retweets, 0 articles), Kurate cs.AI plus cs.LG plus rising-authors weekly leaderboards (no rising-author additions, no HF cross-source confirmation), Reddit (8 subs scraped; r/LocalLLaMA and r/CUDA carried substantive Tier 1 signal on llama.cpp MTP follow-up and Fournex bottleneck analyser, r/reinforcementlearning had one GRPO explanation thread, others empty), parallel Daily-Digest (no file for 2026-05-18 in /Users/amitsinghbhatti/Documents/Claude/Projects/Daily-Digest/). Wiki pages updated: 9 new summary pages (HodgeCover, AIRA, CIPO, NudgeRL, Solvita, Look Before You Leap, MMSkills, PAGER, DiagnosticIQ, FashionChameleon).