cere-bro | 2026-05-18
Today's batch is the deployment-and-design Monday after a theory-heavy weekend. HodgeCover supplies the mathematically precise reason every prior learning-free MoE compressor caps out at moderate compression. AIRA shows LLM agents can autonomously discover Transformer-Mamba hybrid architectures that scale 54% faster than Llama 3.2. Two RLVR papers, CIPO and NudgeRL, attack the same sparse-reward weakness from opposite sides. The Twitter morning slot is empty for the second consecutive Monday, but Reddit and yesterday's Gmail-starred DAIR.AI weekly fill the gap with confirmation of last week's threads.
TL;DR
- HodgeCover (arXiv 2605.13997). Names a structural blind spot in every prior learning-free Mixture-of-Experts (MoE) compressor: three experts can be pairwise mergeable yet form an irreducible cycle when merged together. Models the obstruction as the harmonic kernel of the simplicial Laplacian on a 2-complex of experts. Greedy harmonic-coverage selection wins on aggressive compression frontiers where pairwise-blind methods break.
- AIRA-Compose / AIRA-Design (arXiv 2605.15871). 11 LLM agents in a 24-hour search over Attention, MLP, Mamba primitives produce 14 novel architectures. At 1B pre-trained on a fixed token budget, AIRAformer-D beats Llama 3.2 by 2.4% and AIRAhybrid-D by 3.8%. The compute-optimal scaling slope is 54% steeper than Llama 3.2 and 71% steeper than the best Composer-found Transformer.
- CIPO (arXiv 2605.14539) and NudgeRL (arXiv 2605.15726). Two RLVR (Reinforcement Learning with Verifiable Rewards, the post-training paradigm where the reward signal is a programmatic verifier such as unit tests or exact-match) papers on the same day attacking the same sparse-reward weakness from opposite ends. CIPO converts on-policy failures into correction-oriented supervision, lifting pass@K more than pass@1. NudgeRL conditions rollouts on lightweight strategy contexts, matching vanilla GRPO at 8x larger rollout budgets.
- Solvita (arXiv 2605.15301). Four-agent code framework (Planner, Solver, Oracle, Hacker) with trainable graph-structured knowledge networks. The Hacker constructs adversarial test cases that update the routing weights. Nearly doubles single-pass accuracy on CodeContests / APPS / live Codeforces. First wiki entry where adversarial-test construction is the closed-loop signal source for agent self-evolution.
- DiagnosticIQ (arXiv 2605.08614). 6,690-question benchmark for the rule-to-action step in industrial maintenance. The frontier has closed: top three LLMs sit within one Macro point. Brittleness is the discriminating axis. Under condition inversion, frontier models still pick the original answer 49-63% of the time.
- PAGER (arXiv 2605.15963). Identifies the Semantic-Execution Gap in GUI agents: 88% action-type accuracy but under 6% task success on precision-sensitive geometric tasks. PAGER closes it to 4.1x the strongest general baseline via dependency-structured planning plus precision-aligned RL.
The Big Picture
The wiki's running MoE thread now has both forward and inverse directions in one week. On 2026-05-17 the Kurate cs.LG #13 paper (MoE-muP from Vankadara et al. at Gatsby UCL, the first principled scaling theory for Mixture-of-Experts deriving closed-form prescriptions for initialization, learning rate, weight decay, and routing temperature across the five MoE axes: number of experts M, expert width Ne, routing sparsity K, network width N, depth L) gave the forward recipe: how to scale a new MoE without sweeping. Today, HodgeCover gives the inverse: how to compress an existing MoE without retraining. The mathematical machinery is different (Dynamical Mean Field Theory for MoE-muP, simplicial Laplacians and Hodge decomposition for HodgeCover) but the design surface is shared. Both papers operate on the five-axis MoE space the wiki's llm-routing concept page and the BEAM / DLR / CaRE / RouteProfile cluster have been mapping for the last fortnight. The composition that has not yet been written: a frontier MoE pre-trained under MoE-muP's MSSP recipe, deployed under BEAM's per-token binary mask (2026-05-16, the paper that replaced fixed top-K MoE routing with a per-token learned binary mask trained end-to-end via straight-through estimator, achieving 98%+ retention at up to 85% FLOP reduction), then post-training compressed under HodgeCover's harmonic-coverage objective. Three orthogonal MoE knobs, three principled methods, one stack.
The second thread is RLVR moving from "what does the reward signal say" to "what should the policy do with each piece of that signal." CIPO and NudgeRL on the same day approach this from opposite directions. NudgeRL changes what gets explored by conditioning each rollout on a lightweight strategy-level context. CIPO changes how the failed rollouts get reused by converting them into correction-oriented supervision. Both attack the central RLVR weakness named by the 2026-04-21 paper on RLVR weak supervision (the paper that argued RLVR mostly redistributes probability mass over already-discovered correct answers rather than expanding intrinsic reasoning capacity). CIPO's pass@K-over-pass@1 gain is the cleanest counter-evidence to that critique to date. Whether the pass@K gain replicates on independent benchmarks like AIME 2026 is the load-bearing test for the next 30-60 days.
The third thread is the deployment-calibration gap surfacing in two different domains the same day. DiagnosticIQ on the industrial-maintenance side reports the frontier-LLM cluster within one Macro point, with brittleness as the discriminating axis: 49-63% of frontier-model answers persist under condition inversion. PAGER on the GUI side reports 88% action-type accuracy versus under 6% task success on precision-sensitive geometric tasks. Both papers identify that headline benchmark accuracy and deployment-relevant capability have decoupled along a structural axis. This is the same pattern WildClawBench named on 2026-05-15 (the agent benchmark that measured an 18-point spread between best and worst agent harness running the same model on the same 60 long-horizon tasks) and CurveBench named on 2026-05-17 (the nested-Jordan-curves benchmark where Gemini 3.1 Pro reaches 71.1% Easy and 19.1% Hard, with RLVR lifting an 8B open model from 2.8% to 33.3% on Easy). Four benchmarks in a week reporting the same structural decoupling is past the pattern threshold. The 2026-05-17 digest predicted the field would converge on representational interventions; PAGER and DiagnosticIQ confirm that the gap is post-training-learnable but the per-domain calibration is non-transferable.
Deep Dives
HodgeCover: harmonic-kernel obstructions in learning-free MoE compression
Every prior learning-free MoE compressor scores experts on pairwise compatibility, so every prior method is structurally blind to triples where pairwise mergeable experts form an irreducible cycle when merged together.
Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 1 (MoE compression, inference efficiency)
Prior learning-free MoE compressors HodgeCover (today)
────────────────────────────────── ─────────────────────────
Score experts on pairwise edges Build 2-complex on experts:
(KL merge barrier between i and j) vertices = experts
Greedy: keep low-barrier edges edges = pairwise KL barriers
faces = triplet barriers
Blind spot: triangle of pairwise- Hodge-decompose the edge-barrier
mergeable experts can be a non- signal into gradient (hierarchical),
mergeable triple. The obstruction curl (cyclic), and harmonic
is invisible to any pairwise score. (structural obstruction) components
Greedy coverage of harmonic-critical
edges and triangles is the selection
objective. Hybrid variant adds weight
pruning on survivors.
The paper makes one mathematically precise and one structurally consequential claim. The mathematical claim: the obstruction blocking aggressive learning-free MoE compression is exactly the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers between expert pairs, and whose faces carry triplet merge barriers. Hodge decomposition splits the edge-barrier signal into three orthogonal components, and the harmonic component is the part no pairwise method can see. The structural claim: turning this diagnostic into a greedy coverage objective produces a compressor that ties prior methods at moderate compression and wins at aggressive compression by 5-15% (the wiki's prediction, the paper reports the headline numbers on three open-weight Sparse MoE backbones).
The wiki's MoE compression and routing thread now has its theoretical complement to BEAM (2026-05-16, the paper that replaced fixed top-K MoE routing with a per-token learned binary mask trained end-to-end via straight-through estimator and reported 98%+ retention at up to 85% MoE FLOP reduction with a custom vLLM kernel that exploits the binary structure). BEAM optimises the active-experts-per-token count. HodgeCover optimises the resident-experts-per-layer count. They attack orthogonal axes of MoE serving cost: BEAM reduces FLOPs, HodgeCover reduces parameter memory.
Together with MoE-muP (2026-05-17, Kurate cs.LG #13, the Vankadara et al. paper deriving the first principled scaling theory for Mixture-of-Experts as the Maximally Scale-Stable Parameterization across all five MoE axes, ai_rating 9.0/10), the three papers cover the forward direction (MoE-muP: how to choose M, Ne, K, N, L for a new MoE), the deployment direction (BEAM: how to activate experts per token), and the post-training compression direction (HodgeCover: which experts to keep). The empirical wave that needs these results is visible: Gemma 4 26B-A4B, DeepSeek V4 Pro 1.6T-A49B and Flash 284B-13B, Kimi K2.6, Qwen3.6 35B-A3B, Laguna XS.2 33B-A3B, MiMo-V2.5-Pro all ship MoE under different (M, Ne, K) tradeoffs without a principled recipe. Raschka's 2026-05-17 architecture survey (Gmail-starred, the post cataloguing Gemma 4's KV sharing plus per-layer embeddings, Laguna XS.2's layer-wise attention budgeting, ZAYA1-8B's compressed convolutional attention, and DeepSeek V4's mHC) is the empirical census. Today is the post-training compression piece.
Why it matters: The aggressive-compression frontier matters because that is where deployment under hardware constraints lives. At moderate compression every reasonable scoring method ties. At aggressive compression the pairwise-blind methods break and HodgeCover does not. The 5-15% expected gain compounds with BEAM's per-token compression and MoE-muP's pre-training scaling.
Research angle: Three open problems. (1) Apply HodgeCover to the open-model wave. Falsifiable: run HodgeCover on Gemma 4 26B-A4B, DeepSeek V4 Flash, Kimi K2.6, Qwen3.6 35B-A3B. If the aggressive-frontier win is over 5% on three of four, the harmonic obstruction is systemic and frontier MoEs have been leaving large memory savings on the table. (2) Higher-order obstructions. The paper analyses the 2-complex (triangles). Whether 3-complexes (tetrahedra) reveal additional obstructions at very wide MoE (M >= 64) is the natural extension. (3) Hodge-aware routing. Make BEAM's per-token activation aware of harmonic-critical triples. The composition would be the first MoE serving stack with structural awareness at both the per-token activation step and the per-layer expert-set step.
AIRA-Compose and AIRA-Design: LLM agents discover scaling-frontier architectures
11 LLM agents in a 24-hour compute budget discover Transformer-Mamba hybrids that scale 54% faster than Llama 3.2 on the compute-optimal frontier.
Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 2 (architecture search, agentic systems, hybrid SSM/Transformer)
AIRA-Compose AIRA-Design
──────────────────────── ──────────────────────────
11 agents Up to 20 agents
24-hour compute budget Write attention mechanism code
Combinatorial design space: Write training scripts
Attention, MLP, Mamba Invent operators, not just compose
Two-stage search: Direct mechanistic synthesis
1. Million-param candidates
2. Top designs extrapolated
to 350M, 1B, 3B
Output: 14 novel architectures
AIRAformers (Transformer-based)
AIRAhybrids (Transformer-Mamba)
At 1B pre-trained, fixed tokens:
AIRAformer-D: +2.4% over Llama 3.2
AIRAhybrid-D: +3.8% over Llama 3.2
Compute-optimal scaling slope:
AIRAformer-C: 54% / 71% faster than Llama 3.2 / best Composer Transformer
AIRAhybrid-C: 23% / 37% faster than modified Nemotron-2 / Composer hybrid
The headline is not the 2.4-3.8% accuracy gain. The headline is the scaling-frontier slope. A 54% faster compute-optimal slope at 1B compounds across the roughly 100x compute multiplier between research scale and frontier scale. If the slope generalises, a frontier-scale AIRAformer pre-train would land in the regime that today requires roughly 1.5x more compute on the Llama family.
The agentic-discovery framing is the second contribution. The wiki's prior 2026-04-28 entry on the Hope architecture (the nested-learning architecture paper that beat Transformer at small scale through a hand-designed mechanism) was the most recent novel architecture in the wiki to beat Transformer. AIRA is the first wiki entry where the novel architecture beating Transformer was designed by LLM agents, not humans. This composes with the multi-agent self-evolution thread the LIFE survey (2026-05-17, the 200+ paper multi-agent survey organising work along Lay-Integrate-Find-Evolve stages) named as Stage 4. AIRA-Design is a Stage 4 system applied to the substrate of architecture itself: the agents evolve the model that will eventually be them.
Pair this against Raschka's 2026-05-17 catalog of human-designed moves shipping in the May open-model wave (Gemma 4's KV sharing, Laguna XS.2's layer-wise attention budgeting, ZAYA1-8B's compressed convolutional attention, DeepSeek V4's mHC). Every move in his catalog is convergent: independent teams arrived at similar architectures by hand. AIRA proposes that the same convergence is reachable by automated search at non-trivial scale. If the 54% slope is real at 8B and 30B, the cost of architectural discovery drops by roughly an order of magnitude for any lab with agentic compute.
Why it matters: Architecture search has been on the research backbench for two years because hand-designed architectures kept winning. If AIRA's scaling-frontier slope replicates at 8B and 30B, the backbench moves back into the central rotation. The economic logic mirrors MoE-muP for MoEs: principled discovery is cheaper than expensive empirical sweep.
CIPO and NudgeRL: the two ends of RLVR sparse-reward
Two RLVR papers, same day, opposite ends. NudgeRL changes what gets explored. CIPO changes how the failures from exploration get reused.
Source: HuggingFace Daily Papers 2026-05-18 Links: CIPO Paper · CIPO Wiki · NudgeRL Paper · NudgeRL Wiki Tier: 2 (RLVR, reasoning)
Standard RLVR weakness CIPO NudgeRL
────────────────────── ──────────────── ──────────────────
Sparse binary reward. Failed rollouts mined Each rollout conditioned
Failed rollouts: gradient signal and converted into on lightweight strategy-
is uniformly negative, no correction-oriented level context (e.g. "try
credit assignment to which supervision. Model case-by-case", "find an
step caused failure. attends failed prefix invariant"). Reward signal
and emits correction decomposes into inter- and
continuation. No oracle, intra-context components.
no critic, no PRM. Distillation pushes useful
strategies into base policy.
Brute-force fix: bigger Pass@K gain > pass@1 Matches vanilla GRPO at 8x
rollout budgets. gain. Capacity expansion, larger rollout budget. Strategy
not probability redistr. pool is fixed lightweight.
CIPO is the cleanest counter-evidence in the wiki to the 2026-04-21 critique of RLVR (the wiki's standing reference, the paper that argued RLVR mostly redistributes probability mass over already-discovered correct answers rather than expanding the model's intrinsic reasoning capacity). CIPO's pass@K-over-pass@1 gain is the specific empirical pattern that critique predicted should be absent. Whether the pattern survives independent replication on AIME 2026 is the load-bearing test.
NudgeRL's 8x rollout-budget equivalence is the efficiency claim. Under matched final accuracy, strategy-conditioned exploration uses roughly an order of magnitude less compute than brute-force rollout scaling. The strategy pool is hand-crafted lightweight scaffolds, not oracle hints. The deployed inference-time policy is the distilled base, not the strategy-conditioned model.
The natural composition: NudgeRL diversifies the rollouts, CIPO recycles the failures. The first paper to compose them is one experiment away. Both inherit the standard RLVR reward-hacking risk that the wiki has been tracking via the Kurate cs.LG #10 paper from this week (LLMs Gaming Verifiers, ai_rating 6.8/10, showing RLVR pipelines can be reward-hacked when the policy learns to game the verifier rather than solve the task). Whether more aggressive use of verifier signal (CIPO) and more diverse exploration (NudgeRL) amplifies or dampens reward hacking is the next falsifier.
→ CIPO full summary · NudgeRL full summary
Solvita: agentic evolution for code with an adversarial Hacker
Four-agent code framework where the Hacker constructs adversarial tests that update routing weights. Nearly doubles single-pass accuracy.
Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 2 (agentic systems, self-evolution, code)
Planner ◄──┐ Outcome signals:
│ │ ─────────────────
▼ │ Pass/fail verdict ───► Planner, Solver
Solver ────┼──► Verifier Test cert. quality ───► Oracle
│ │ Adversarial vuln. ───► Hacker, Solver
▼ │
Oracle ────┘
│
▼
Hacker (constructs adversarial tests)
│
└─────► RL updates on each agent's graph-structured knowledge network
The Hacker is the structural innovation. Most prior multi-agent code systems used a critic or verifier that grades a proposed solution post hoc. Solvita's Hacker actively constructs adversarial test cases that target the Solver's likely failure modes, and the resulting attack patterns are stored as RL signal on the Hacker's network. The base LLM is frozen throughout; all learning happens in the per-agent graph-structured knowledge networks.
This is the most concrete Stage 4 system in the wiki under the LIFE survey taxonomy (the 200+ paper multi-agent survey from 2026-05-17 that organised the field along Lay capability foundation, Integrate via collaboration, Find faults via attribution, Evolve via self-improvement). Solvita integrates four specialised agents (Stage 2), uses the Hacker to attribute failures (Stage 3), and evolves the routing graphs from those failures (Stage 4). The closed loop is functional and not theoretical.
Compare against Conductor (2026-05-11, Sakana AI's ICLR 2026 paper that trained a 7B RL orchestrator to invoke frontier models GPT-5, Claude Sonnet 4, Gemini 2.5 Pro at roughly 3 calls per question and beat every individual model on GPQA-D / LiveCodeBench / AIME25). Conductor and Solvita are both routing systems, both trained by RL, both target frontier-model performance with small-model orchestration. Conductor routes between heterogeneous frontier models; Solvita routes between homogeneous agent roles. Conductor's 3-calls-per-question budget is roughly equivalent to Solvita's four-agent overhead. Whether per-role specialisation (Solvita) or per-model routing (Conductor) is the better economic structure is the cross-paper question.
PAGER and DiagnosticIQ: deployment-calibration is now a structural axis, not a model axis
88% action-type accuracy versus under 6% task success in GUI agents. Top three LLMs within one Macro point on industrial-rule benchmarks but 49-63% original-answer-rate under condition inversion. Two benchmarks today, same pattern.
Source: HuggingFace Daily Papers 2026-05-18 Links: PAGER Paper · PAGER Wiki · DiagnosticIQ Paper · DiagnosticIQ Wiki Tier: 2 (deployment, responsible AI, agentic systems)
PAGER reframes the GUI-agent metric. The dominant region-tolerant paradigm scores agents on action-type accuracy: did the click land inside the right component? On precision-sensitive geometric tasks, where actions must land on specific points in continuous canvas space, this is the wrong metric. PAGE Bench (4,906 problems, 224K process-supervised pixel-level actions) reports that general multimodal models exceed 88% action-type accuracy but stay under 6% task success. The Semantic-Execution Gap is between knowing what action to take and executing it precisely enough that downstream geometry-dependent steps still work. PAGER closes the gap to 4.1x the strongest baseline via dependency-structured planning plus precision-aligned RL with state-conditioned geometric feedback.
DiagnosticIQ runs the same diagnostic at a different layer. On 6,690 industrial-maintenance multiple-choice questions, the top three frontier LLMs land within one Macro point. The capability axis has flattened. The brittleness axis has not. Every model loses 13-60% relative accuracy under distractor expansion (DiagnosticIQ Pro). Under condition inversion (DiagnosticIQ Aug), 49-63% of frontier-model answers persist as the original answer. In other words, the model recognises the surface pattern of the rule rather than its logical content.
Read these two papers next to WildClawBench (2026-05-15, the agent benchmark that measured an 18-point spread between best and worst agent harness running the same model on the same 60 long-horizon tasks) and CurveBench (2026-05-17, the nested-Jordan-curves benchmark where Gemini 3.1 Pro reaches 71.1% Easy and 19.1% Hard, with RLVR lifting Qwen3-VL-8B from 2.8% to 33.3% on Easy). Four benchmarks in five days identify the same structural pattern: headline accuracy and deployment-relevant capability have decoupled along a structural axis. The pattern threshold of three was crossed earlier in the week; today brings the count to four. The 2026-05-17 digest predicted the field would converge on representational interventions. PAGER and DiagnosticIQ confirm the gap is post-training-learnable (PAGER closes it via state-conditioned RL, similar in spirit to CurveBench's RLVR result) but suggests the calibration is per-domain.
→ PAGER full summary · DiagnosticIQ full summary
FashionChameleon: KV cache rescheduling as a fourth axis
Training-Free KV Cache Rescheduling for interactive multi-condition video. Composes garment KV refresh, historical KV withdraw, reference KV disentangle.
Source: HuggingFace Daily Papers 2026-05-18 Links: Paper · Wiki Tier: 1 (KV cache, video generation)
The wiki's KV cache thread has been mapping three orthogonal compression and routing axes. Make Each Token Count (2026-05-12, the paper that scored each cached entry with a small projection and showed selective retention can surpass the full cache) introduced learned eviction. Forcing-KV (2026-05-15, the paper that found attention heads in autoregressive video diffusion cluster into static and dynamic functional roles, where static roles tolerate aggressive compression and dynamic roles do not) introduced head-role compression. Gemma 4's KV sharing (surveyed in Raschka's 2026-05-17 post, where later layers reuse earlier layers' K and V projections, halving cache size at 128K context) introduced architectural sharing.
FashionChameleon adds a fourth axis: content-aware KV cache rescheduling. When a user switches a garment mid-video, the cache must simultaneously refresh entries pertaining to the new garment, withdraw entries that encoded the old garment, and disentangle entries that came from reference inputs versus generated history. This is not eviction (which entry to drop), not compression (how many bits per entry), and not sharing (which layers reuse). It is the cache-as-state-machine view: which entries belong to which causal-conditioning thread.
The four axes compose. The 2026-05-17 digest projected that a stack combining learned eviction, head-role compression, and architectural sharing would multiply roughly to 8x memory reduction. Adding rescheduling extends the stack to multi-conditioning scenarios. The wiki has no entry yet for KV cache rescheduling in text. FashionChameleon is the candidate template.
Why it matters: Multi-conditioning workloads (multi-document RAG, multi-tool agentic conversations, multi-reference video) are now the dominant deployment shape. A cache management discipline that explicitly tracks which conditioning thread each entry serves is what those workloads need.
Research angle: Whether the three-mechanism rescheduling (refresh / withdraw / disentangle) generalises to text multi-conditioning is the natural test. Falsifiable: implement the same three operations in a multi-document RAG harness and measure cache efficiency versus a single-bucket baseline.
Industry Pulse
- DAIR.AI weekly papers (Gmail-starred via LinkedIn newsletter, 2026-05-17). Six papers in the weekly: Lighthouse Attention (covered in 2026-05-16 digest), Is Grep All You Need (the paper arguing grep-style text search matches or exceeds embedding retrieval inside coding agents under controlled harness conditions, isolating harness design as the dominant variable), Goodfire's geometric calculator (the interpretability finding that an LLM represents numbers as Fourier features on circles in activation space, with arithmetic as rotation, and the same circuit reused beyond arithmetic), delta-mem (the frozen-backbone associative-memory paper with an 8x8 online state that lifts the backbone by 1.10x average and 1.31x on MemoryAgentBench, covered in 2026-05-13 digest), the LIFE multi-agent survey (covered yesterday), and AutoTTS (test-time scaling reframed as controller-search over pre-collected reasoning trajectories). Two of the six are new wiki signal: Is Grep All You Need confirms WildClawBench's harness-dominance thesis from 2026-05-15 (the 18-point harness spread paper), and Goodfire's geometric-calculator finding extends the mechanistic-interpretability thread the wiki has been tracking through the May 16 "All Circuits Lead to Rome" cluster on circuit non-uniqueness.
- The Semiconductor Newsletter Week 20 (Gmail-starred 2026-05-17). AI-infrastructure thread continues. Headlines: Tata Electronics and ASML expand on the India 300mm fab. POET Technologies / Lumilens advance wafer-level photonic integration for AI optical interconnect scaling. Applied Materials and TSMC expand the EPIC Center AI process collaboration. Tower Semiconductor secures $1.3B silicon photonics contracts for 2027. NVIDIA and Ineffable Intelligence target RL infrastructure for next-generation AI. The optical-interconnect-for-AI thread crosses three independent items in one week. Tier 1 hardware context: the data-centre topology assumption for 2027 is now optical interconnect at wafer-level integration, not the copper-plus-NVLink baseline the May 2024 papers assumed.
- Gary Marcus on Marcus on AI (Gmail-starred 2026-05-17). Three interview clips on neurosymbolic AI, world models, hyperscaling skepticism, and software verification. Tier 4 commentary; flagged for the running thread on critical assessments of pure-LLM scaling and the case for verification-heavy deployment, which intersects today's DiagnosticIQ brittleness finding.
- MTP follow-up PR on llama.cpp (PR #23198, r/LocalLLaMA score 135). A follow-up to the 2026-05-17 merged MTP support, avoiding logits-copy during prompt decode. The Strix Halo prompt-processing regression measured yesterday is now partially addressed upstream within 24 hours. Speculative-decoding-on-consumer-hardware continues to consolidate.
- Fournex GPU bottleneck analyser (r/CUDA discussion). Open-source tool that turns Nsight Compute output into evidence-backed optimisation recommendations for CUDA kernels: classifies bottlenecks from hardware-counter evidence, ranks by severity, generates concrete remediations with metric references. Detects uncoalesced global memory access, L1/L2 cache thrashing, tensor-core underutilisation, warp-stall patterns, register-pressure spills. The kernel-authoring-and-diagnostics layer continues to diversify (Cutile-rs landed on 2026-05-17). Production CUDA work is moving from artisanal to instrumented within a 30-day window.
Connecting the Dots
MoE design surface, all three directions in one week
MoE-muP (2026-05-17, theory) HodgeCover (2026-05-18, compression)
───────────────────────────── ─────────────────────────────────
Closed-form MSSP across Harmonic-kernel obstruction in
M, Ne, K, N, L axes for simplicial Laplacian on expert
scale-stable MoE pre-training 2-complex. Greedy harmonic
(Vankadara et al., Kurate coverage selects which experts
cs.LG #13 ai_rating 9.0/10) to keep at aggressive compression
│ │
▼ ▼
BEAM (2026-05-16, per-token activation)
Binary expert-activation masks trained end-to-end via straight-through estimator.
Decides which experts to activate per token. 98%+ retention at 85% FLOP reduction.
────────────────────────────────────────────────────────────────────────────────
Composition not yet written: MSSP-scaled MoE pre-train + BEAM per-token activation
+ HodgeCover post-training compression = full three-axis MoE serving stack.
RLVR sparse-reward surface, both ends in one day
NudgeRL (today) CIPO (today)
──────────────────── ──────────────────
Changes WHAT gets explored. Changes HOW failures are reused.
Strategy-level context per Failed trajectories converted into
rollout, distillation back to correction-oriented supervision.
base policy. Matches GRPO at 8x Pass@K gain > pass@1 gain.
smaller rollout budget. Counters RLVR weak-supervision critique
(2026-04-21) directly.
│ │
└─────────── compose: nudge exploration, recycle failures ──────┘
Single experiment away. Diagnostic: pass@K gain
under composition versus sum of individual gains.
Deployment-calibration gap, fourth confirmation
WildClawBench (2026-05-15) CurveBench (2026-05-17)
───────────────────────── ────────────────────────
18-point harness spread on Gemini 3.1 Pro 71.1% / 19.1%.
60 long-horizon tasks RLVR on 8B model: 2.8% -> 33.3%
│ │
└──────────► PAGER (today) + DiagnosticIQ (today) ◄──────────
88% action-type / <6% task success on GUI geometry
Top-3 LLMs within 1 Macro point, 49-63% original-answer
rate under condition inversion on industrial rules
────────────────────────────────────────────────────
Four benchmarks in five days. The structural decoupling
between headline accuracy and deployment-relevant capability
is past pattern threshold. RLVR closes it per-domain.
Cross-paper thread #1: the MoE design surface is now mapped end-to-end. This is the most concentrated week the wiki has tracked on MoE scaling. MoE-muP (2026-05-17, Vankadara et al. at Gatsby UCL plus Amazon plus Tübingen, Kurate cs.LG #13 ai_rating 9.0/10, the first principled scaling theory for Mixture-of-Experts deriving closed-form MSSP prescriptions for initialization, learning rate, weight decay, and routing temperature across the five MoE axes M, Ne, K, N, L using Dynamical Mean Field Theory) gives the forward direction: how to scale a new MoE. BEAM (2026-05-16, the paper that replaced fixed top-K routing with a per-token learned binary mask trained end-to-end via straight-through estimator, achieving 98%+ retention at up to 85% FLOP reduction with a custom vLLM kernel for the binary structure) gives the per-token activation direction: which experts to run for this token. HodgeCover (today, the paper that models the obstruction blocking aggressive learning-free MoE compression as the harmonic kernel of the simplicial Laplacian on a 2-complex of experts and uses Hodge decomposition to identify the harmonic-critical edges and triangles) gives the post-training compression direction: which experts to keep. Three orthogonal MoE knobs, three principled methods, all in one week. The empirical wave that needs these is visible: Gemma 4 26B-A4B, DeepSeek V4 Pro 1.6T-A49B and Flash 284B-13B, Kimi K2.6, Qwen3.6 35B-A3B, Laguna XS.2 33B-A3B, MiMo-V2.5-Pro. Six frontier-tier open MoEs are now in the wild, all designed by hand without a principled recipe. The next-generation training pipeline that combines MoE-muP scaling, BEAM activation, and HodgeCover compression is a single integration paper away.
Cross-paper thread #2: the RLVR sparse-reward weakness gets both an exploration fix and a credit-assignment fix on the same day. RLVR Weak Supervision (2026-04-21, the wiki's standing reference paper that argued RLVR mostly redistributes probability mass over already-discovered correct answers rather than expanding the model's intrinsic reasoning capacity) has been the open critique the wiki tracked for a month. CIPO (today, the paper that mines on-policy failed trajectories and converts them into correction-oriented supervision by pairing each failed prefix with a correct continuation derived from the same model's adjacent success rollouts) and NudgeRL (today, the paper that conditions each rollout on a lightweight strategy-level context drawn from a fixed pool and decomposes the reward into inter- and intra-context components, with a distillation pass that pushes useful strategies into the base policy, matching vanilla GRPO at 8x larger rollout budgets) are the two complementary answers. CIPO's pass@K-over-pass@1 gain is the cleanest empirical counter to the RLVR Weak Supervision critique to date. The compose-CIPO-with-NudgeRL experiment is one paper away.
Cross-paper thread #3: the deployment-calibration gap crosses four confirmations. WildClawBench (2026-05-15) was the first. CurveBench (2026-05-17) was the second. PAGER and DiagnosticIQ today bring the count to four benchmarks in five days reporting the same pattern: headline accuracy and deployment-relevant capability have decoupled along a structural axis. The decoupling is not random. It is consistent across GUI agentic execution (PAGER), industrial-rule reasoning (DiagnosticIQ), agent harness selection (WildClawBench), and visual structural reasoning (CurveBench). The 2026-05-17 digest predicted the field would converge on representational interventions. PAGER and DiagnosticIQ confirm the gap is post-training-learnable; PAGER closes its gap with precision-aligned RL and state-conditioned geometric feedback, similar in spirit to CurveBench's RLVR result. The remaining open question is whether a single representational intervention closes the gap across all four domains or whether each requires per-domain post-training.
Cross-paper thread #4: agentic self-evolution at the architecture level. AIRA-Compose and AIRA-Design (today) are the first wiki entries where LLM agents discover neural architectures that scale faster than hand-designed baselines at non-trivial parameter count. This composes with the multi-agent self-evolution cluster that LIFE (2026-05-17, the 200+ paper multi-agent survey organising work along Lay-Integrate-Find-Evolve stages) unified yesterday. EvolveMem (2026-05-15, retrieval-configuration self-evolution), Orchard (2026-05-15, credit-assignment SFT), SDAR (2026-05-15, sigmoid-gated on-policy self-distillation), EvoEnv (2026-05-15, environment synthesis), FrontierSmith (2026-05-16, open-ended problem generation), Sylph AI (2026-05-16 social-stream, harness construction), Solvita (today, agent-role routing graphs) all evolve different substrates. AIRA evolves the substrate of architecture itself. The cluster is now eight papers; if the trend continues, agentic self-evolution is the dominant Stage 4 form by July.
Cross-paper thread #5: KV cache rescheduling joins the cache-management taxonomy. FashionChameleon (today) extends the wiki's KV cache thread from three orthogonal compression-and-routing axes (learned eviction via Make Each Token Count 2026-05-12, head-role compression via Forcing-KV 2026-05-15, architectural sharing via Gemma 4 surveyed by Raschka 2026-05-17) to a fourth: content-aware rescheduling for multi-conditioning workloads. The four-axis stack is now mapped. The text-side rescheduling generalisation is the missing piece.
Worth Watching
- HodgeCover on the open-model wave. 30-60 days. Apply HodgeCover to Gemma 4 26B-A4B, DeepSeek V4 Flash 284B-13B, Kimi K2.6, Qwen3.6 35B-A3B. If the aggressive-compression frontier win is over 5% on three of four, the harmonic obstruction is systemic and frontier MoEs have been leaving large memory savings on the table. If under 5%, the obstruction is rare in well-trained MoEs and the theoretical contribution is more significant than the empirical one.
- AIRA's scaling slope at 8B and 30B. 60-90 days. The 54% steeper compute-optimal slope is measured at 1B. Whether the slope generalises to 8B and 30B is the load-bearing extrapolation question. Falsifiable: train AIRAformer-C at 8B and 30B under a fixed token budget and check the scaling fit. If the slope holds within 10% at 8B, agentic architecture search is back on the central rotation.
- CIPO pass@K gain replication on independent benchmarks. 30-60 days. The cleanest empirical test is AIME 2026 or LiveCodeBench. If the pass@K-over-pass@1 gain holds at K=32, the RLVR Weak Supervision critique (2026-04-21, the paper arguing RLVR redistributes capability rather than expanding it) is genuinely contradicted. If only pass@1 rises while pass@K stays flat, CIPO is doing the redistribution the critique predicted.
- CIPO + NudgeRL composition. 30 days. The simplest experiment: strategy-nudged rollouts (NudgeRL) plus correction-oriented supervision on the failed strategies (CIPO). Diagnostic: pass@K under composition versus sum of individual gains. If super-additive, the two methods address different bottlenecks. If sub-additive, they overlap more than the surface description suggests.
- MoE-muP MSSP back-fit against frontier MoEs. Carried over from 2026-05-17 Worth Watching. 60-90 days. Take Gemma 4 26B-A4B's published KV-sharing fraction, expert count, and expert width, and check whether MSSP predicts those choices at the published compute budget. If MSSP predicts within 10% of empirical for two or more of Gemma 4, Kimi K2.6, DeepSeek V4, the recipe is real.
- Adversarial-Hacker generalisation in Solvita. 60 days. Whether the Hacker's learned attack patterns generalise to a held-out problem distribution (e.g. Codeforces Div 1) or simply memorise per-benchmark exploits is the deployment-relevant test. If they generalise, adversarial-test construction is a reusable Stage 4 signal source.
- MTP win-loss crossover on consumer hardware. Carried over from 2026-05-17. Today's PR #23198 addressing the prompt-processing regression is the first move toward formalising the 27B-wins-35B-mixes crossover. Track the next 30 days of llama.cpp benchmark threads on Strix Halo and RTX 5090.
- LLM-rated underrated from Kurate (current week). The weekly cs.AI and cs.LG leaderboards moved slightly from last week; cs.LG #13 is still the MoE-muP paper deep-dived yesterday. Other current-week picks: cs.AI #5 "AI scientists produce results without reasoning scientifically" by Ríos-García et al. (ai_rating 8.5/10, recurring) on the surface-mimicry failure mode of AI scientists; pairs cleanly against today's DiagnosticIQ Aug finding of 49-63% original-answer-rate under condition inversion. cs.AI #11 "Hodoscope: Unsupervised Monitoring for AI Misbehaviors" (ai_rating 7.2/10, recurring) on unsupervised anomaly-detection of model behaviour signatures, the upstream interpretability companion to DiagnosticIQ. cs.AI #12 "Value-Conflict Diagnostics Reveal Widespread Alignment Faking" (ai_rating 7.0/10, recurring). cs.LG #10 "LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking" (ai_rating 6.8/10) is directly relevant to today's CIPO and NudgeRL: both papers use the verifier signal more aggressively and inherit the reward-hacking risk this paper documents. cs.LG #11 "The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime" (ai_rating 7.8/10). cs.LG #12 "LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit" (ai_rating 8.0/10, recurring; still in tension with the 2026-05-16 mechanistic-interpretability cluster that argued circuits are non-unique).
- Rising authors from Kurate. No new authors crossed threshold this week. The current crossing cluster (Guy Lutsker, Andrew Zhang, Haotian Ye, Siavash Golkar, Hannah Guan, Martiño Ríos-García) is dominated by biomedical and scientific-discovery teams, which sit at Tier 3 for the wiki's research-engineering focus. No add-to-handles suggestions this week.
- Cross-source confirmation (HF + Kurate). Today's HuggingFace top and the current Kurate cs.AI / cs.LG weeklies have no direct overlap on Tier 1 or Tier 2 topics. The cross-source-confirmed Tier 1 promotion rule did not fire today.
Quick Hits
Flash-GRPO (arXiv 2605.15980). Single-step training framework for aligning video diffusion models via Group Relative Policy Optimization (GRPO, the lightweight on-policy RL recipe). Two contributions: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency; temporal gradient rectification neutralises the time-dependent scaling factor that produced inconsistent gradient magnitudes across timesteps. Validated at 1.3B to 14B parameters. Tier 3 video alignment; structurally interesting as the video-diffusion analog of the RLVR variance-control work the wiki has tracked through Balanced Aggregation GRPO (2026-05-09).
InsightTok (arXiv 2605.14333). Discrete visual tokeniser with localised content-aware perceptual losses for text and face fidelity in autoregressive image generation. 16k codebook, 16x downsampling. Tier 3 image generation.
DepthVLM (arXiv 2605.15876). Attaches a lightweight depth head to a VLM's LLM backbone and trains under unified vision-text supervision in a two-stage schedule, producing full-resolution depth maps alongside language outputs in a single forward pass. Tier 3 multimodal; one of the cleaner unified-vision-language extensions.
Look Before You Leap (arXiv 2605.16143). Already cross-referenced in the RLVR thread. Introduces Exploration Checkpoint Coverage as a verifiable exploration metric independent of task reward, then trains agents with interleaved task and exploration rollouts under the Explore-then-Act paradigm. Tier 2 agentic systems. → Full summary
MMSkills (arXiv 2605.13527). Multimodal skill packages (text procedure plus state cards plus multi-view keyframes) for visual agents, with a branch-loaded inference pattern that keeps heavy multimodal evidence on a side branch. Tier 2. → Full summary
DexJoCo, FFAvatar, OmniHumanoid, ReactiveGWM, WorldAct. Robotics dexterous manipulation benchmark, few-shot 3D Gaussian avatar reconstruction, cross-embodiment humanoid video generation, reactive game-NPC world model, static-to-interactive 3D world activation. Tier 4 across the board. Skip.
Sources ingested today: HF (18 papers; 7 substantive at Tier 1 or Tier 2, 3 at Tier 3, 5 at Tier 4 skipped), Gmail (3 starred: Semiconductor Newsletter Week 20, DAIR.AI weekly top papers, Gary Marcus video roundup), RSS (no new file for 2026-05-18; latest is 2026-05-17, content already integrated into yesterday's digest), Twitter morning slot (2 sparse AI-handle tweets from @MillionInt, 0 retweets, 0 articles), Kurate cs.AI plus cs.LG plus rising-authors weekly leaderboards (no rising-author additions, no HF cross-source confirmation), Reddit (8 subs scraped; r/LocalLLaMA and r/CUDA carried substantive Tier 1 signal on llama.cpp MTP follow-up and Fournex bottleneck analyser, r/reinforcementlearning had one GRPO explanation thread, others empty), parallel Daily-Digest (no file for 2026-05-18 in /Users/amitsinghbhatti/Documents/Claude/Projects/Daily-Digest/). Wiki pages updated: 9 new summary pages (HodgeCover, AIRA, CIPO, NudgeRL, Solvita, Look Before You Leap, MMSkills, PAGER, DiagnosticIQ, FashionChameleon).