cere-bro | 2026-05-15
Six papers on agent memory in one batch, two on the routing-decision design space, one on head-role-conditioned KV cache compression for video diffusion, plus the asynchronous-continuous-batching scheduling primitive. Yesterday's thread was "where does the routing decision live." Today's is "memory and environment are the next programmable substrates, and the eval ceilings on both are below 60%."
TL;DR
- Agent memory cluster (six papers). STALE caps the best frontier model at 55.2% on implicit-conflict detection. MemLens caps multi-session reasoning below 30%. EvolveMem self-evolves retrieval configuration (+25.7% on LoCoMo). Preping builds memory pre-task at 2-3x lower deployment cost. The memory layer is now a programmable substrate, and the eval ceiling is structurally low. → cluster summary
- RouteProfile (arXiv 2605.00180). Structured profiles beat flat ones, query-level signals beat domain-level signals, and generalization to new models needs structured + trainable profiles. The routing decision was the wrong unit of analysis, the profile is. Tier 1.
- Dynamic Latent Routing (DLR) (arXiv 2605.14323). Joint learning of discrete latent codes, routing policies, and model parameters in one stage. +6.6 pp over SFT in low-data fine-tuning, motivated by the General Dijkstra Search theorem. The first paper in the wiki where routing is a training objective, not a deployment concern. Tier 1.
- Forcing-KV (arXiv 2605.09681). Head-role-conditioned KV cache compression for AR video diffusion. Static heads vs dynamic heads as a stable functional split. 29+ fps on H200 at 30% memory reduction, 2.82x at 1080P. Tier 1.
- HuggingFace asynchronous continuous batching (blog). Three CUDA streams, two buffer slots, no model changes. GPU utilization 76.0% → 99.4%, 22% generation speedup on 8B at batch 32. The scheduling primitive that production inference was missing.
- Orchard (arXiv 2605.15040). Open-source agentic training framework. 64.3% → 67.5% on SWE-bench Verified at 30B after SFT and SFT+RL. Open-source SOTA at this scale.
- WildClawBench (arXiv 2605.10912). Native-runtime long-horizon agent benchmark, 60 tasks, 8 min wall-clock each. Claude Opus 4.7 tops out at 62.2%. Switching harness alone shifts performance by 18 points.
- SU-01 (arXiv 2605.13301). 30B-A3B reasoning model trained with a unified three-stage recipe, 200 RL steps, gold-medal IMO 2025 + USAMO 2026 + IPhO 2024/2025. The "specializable-generalist" framing has its first frontier-tier datapoint.
- EvoEnv (arXiv 2605.14392). Self-improving RL via verifiable environment synthesis. The asymmetry condition (solve-verify gap) as the structural invariant. Gains from Qwen3-4B-Thinking, where fixed-data RLVR reduces the score.
- Darwin Family (arXiv 2605.14386). Training-free evolutionary merging. 86.9% on GPQA Diamond, rank #6 of 1,252 models, no gradient training. Cross-architecture Transformer + Mamba breeding.
- Industry pulse. Subquadratic ships Appen-validated 56.2x speedup vs FlashAttention-2 at 1M tokens, 81.8% SWE-bench Verified (Gmail-starred). xAI ships Grok Build CLI for SuperGrok Heavy (Tweet). NVIDIA releases NVFP4 Kimi-K2.6 quantization (HF). RTX 5090 is the only consumer GPU not dropping in EU price (r/LocalLLaMA tracker).
The Big Picture
Yesterday the thread was "where does the routing decision live." Five layers (model, adapter, expert, distillation loss, decoding head) all got papers in three weeks. Today adds two more: the profile the router consults (RouteProfile) and the latent code the model learns during training (DLR). Routing is now an eight-layer design space, and one of the layers (profile design) is the thing the routing community was systematically ignoring.
The second thread is harder to miss: memory is the next layer up from KV cache, and six papers in one day pull it into the same architectural treatment. STALE caps implicit-conflict detection at 55.2%. MemLens caps multi-session multimodal at below 30%. EvolveMem self-evolves retrieval. Preping constructs memory before tasks. The 2026-05-12 Make Each Token Count paper said the KV cache should be a programmable substrate, not a buffer. The 2026-05-15 cluster says the same about agent memory, one layer up, with the same kind of evidence (eval ceilings below what scaling alone can fix).
The third thread is the eval-ceiling pattern continuing. Yesterday: AgentLens (10.7% Lucky), AssetOpsBench (-0.13 correlation), Soohak (refusal failure). Today: WildClawBench (62.2% best, 18-point harness sensitivity), STALE (55.2%), MemLens (<30%). The pattern is now four days long across distinct benchmarks. Every layer of the agentic stack that gets a process-aware, native-runtime, or staleness-aware benchmark caps below what the existing pass-rate metrics suggested. The cyber-eval doubling-rate from AISI (05-13) still hasn't been measured under any of these new lenses.
The compute-side thread also moves. Asynchronous continuous batching closes 22% of the CPU-GPU idle gap with no model or kernel changes; Forcing-KV closes 30% of the video-diffusion cache cost with role-conditioned compression; NVIDIA's NVFP4 Kimi-K2.6 release and r/LocalLLaMA's TurboQuant practitioner study fill in the quantization axis. Three pieces of the inference stack updating in one week. None of them are model changes. All of them are deployable today.
Deep Dives
Agent memory: the new programmable substrate (cluster of 6)
Memory was the next layer up from KV cache. Today the wiki gains its first cluster of evidence that memory deserves the same architectural treatment: programmable, policy-aware, co-evolving, evaluated under realistic conditions. The eval ceilings (55.2% on STALE, <30% on multi-session multimodal) are evidence this is unsolved.
Sources: HuggingFace Daily Papers (all six) Links: STALE · Preping · EvolveMem · MemEye · MemLens · BOOKMARKS · Cluster wiki · agent-memory concept page Tier: 2 cluster (treated as Tier 1 in space allocation because of the six-paper convergence)
Evaluation Construction Adaptive infra
────────────────── ───────────────── ────────────────────
STALE 55.2% on staleness Preping (pre-task) EvolveMem
MemEye visual-fidelity loss 2.99x cheaper than online auto-discovers retrieval
MemLens <30% multi-session config from failure logs
BOOKMARKS storyline (no abstract)
│ │ │
└────────────────────────────────────┼──────────────────────────────┘
│
shared diagnosis (across all six):
memory systems freeze the retrieval mechanism,
treat content as static facts, lose visual fidelity
under compression. Implicit conflict is the hardest
failure mode and propagates downstream.
STALE is the load-bearing eval paper. The failure mode it names, Implicit Conflict, is a later observation invalidating an earlier memory without explicit negation. Contextual inference and commonsense reasoning are required to detect. Best model: 55.2% across 1,200 queries. Three probing dimensions: State Resolution (detect that prior belief is outdated), Premise Resistance (reject queries that falsely presuppose stale state), Implicit Policy Adaptation (proactively apply updated state in downstream behavior). Models accept outdated assumptions embedded in user queries and fail to propagate state changes across related memories. CUPMem prototype: structured state consolidation + propagation-aware search at write time.
Preping answers the cold-start question. Most agents face an empty memory when first introduced to a new environment. Preping builds procedural memory before any target-environment tasks, using only self-generated synthetic practice. The Proposer-Solver-Validator loop with structured control state shapes future practice; not all synthetic data is equal, and proposer-side control over feasibility/redundancy/coverage is what makes it work. 2.99x lower deployment cost on AppWorld than online memory construction, 2.23x on BFCL v3, competitive with playbook-based methods. The cost comparison is the production-relevant number: pre-task memory is cheaper than learning-from-deployment.
EvolveMem is the architectural innovation. Most memory systems treat retrieval as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. EvolveMem exposes the full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module reading per-question failure logs. Closed-loop self-evolution: AutoResearch on the system's own architecture. +25.7% relative on LoCoMo over the strongest baseline, +78.0% over the minimal baseline, +18.9% on MemBench. Evolved configurations transfer across benchmarks with positive (not catastrophic) transfer. This is the agent-memory analogue of Make-Each-Token-Count's learned KV eviction: substrate-as-policy, one layer up.
MemEye and MemLens are the multimodal eval papers. MemEye constructs a two-dimensional framework (visual-evidence granularity × usage from single-evidence to evolutionary-synthesis) with ablation-driven validation gates (answerability, shortcut resistance, visual necessity, reasoning structure). MemLens runs 27 LVLMs and 7 memory-augmented agents across 5 memory abilities at 4 context lengths (32K-256K). Both find that visual fidelity is the choking point: long-context LVLMs do well at short context but degrade as conversations grow, memory agents are length-stable but lose visual fidelity under storage-time compression, multi-session reasoning caps most systems below 30%. Neither approach alone solves the task.
Why it matters: Six papers in one day make this a cluster, not a coincidence. The agent-memory layer is the substrate the wiki was about to need a concept page for (just added). The eval ceilings rule out "scaling will fix it." The architectural moves (co-evolved retrieval, pre-task construction, propagation-aware writes) are deployable today.
Research angle: Three threads worth pulling. (1) EvolveMem + STALE composition. EvolveMem auto-discovers retrieval configurations; STALE probes implicit conflicts. A retrieval policy that EvolveMem evolves specifically to detect stale state would close the loop. No paper has written this. (2) Memory-as-routing-signal. If staleness can be detected per-query (the STALE State Resolution probe), it can drive routing decisions between retrieval, refresh, or fallback paths. Untested. (3) Hybrid long-context + structured-retrieval architecture. MemLens explicitly motivates this. The architecture has not been published.
→ Cluster summary · agent-memory concept page
RouteProfile + Dynamic Latent Routing: the routing design space opens on two axes
The routing literature has been obsessed with router mechanism design. RouteProfile makes the profile (how candidates are described to the router) an independent design surface with four dimensions. DLR makes the latent code a routing target during training. Two papers in one day pulling routing in opposite directions: deployment-time profile design, training-time internal routing.
Sources: HuggingFace Daily Papers Links: RouteProfile paper · RouteProfile wiki · DLR paper · DLR wiki Tier: 1. Routing design space, profile representation, training-time routing
Deployment-time layer Training-time layer
────────────────────────────── ────────────────────────
RouteProfile (today) DLR (today)
─ profile design space (4 dims): ─ joint learning of:
organizational form discrete latent codes
representation type routing policies
aggregation depth model parameters
learning configuration (one stage)
─ findings: ─ +6.6 pp over SFT on
structured > flat low-data fine-tuning
query-level > domain-level ─ motivated by General
trainable beats frozen on new models Dijkstra Search theorem
RouteProfile treats LLM profiling as a heterogeneous-data integration problem. The 4D design space (organizational form, representation type, aggregation depth, learning configuration) is the contribution. Three findings hold across three representative routers under standard and new-LLM-generalization settings: structured profiles consistently outperform flat ones; query-level signals are more reliable than coarse domain-level signals; generalization to newly introduced models benefits most from structured profiles with trainable configurations. The new-LLM-generalization finding is the production-relevant one. Routing fleets add models monthly; structured + trainable profiles let the router cold-start new models from a small description instead of waiting for empirical traces.
The implication is uncomfortable: every routing paper in the wiki (TraceR, CARE, Sakana Conductor, Netflix State of Routing, MinT) studied the dispatcher and used whatever profile representation was convenient. The same router can be a strong or weak system depending entirely on how candidates are described. The wiki's existing routing pages should be re-read in this light.
DLR moves routing into the post-training pipeline. The General Dijkstra Search theorem (globally optimal goal-reaching policies in MDPs with time-varying rewards can be recovered through temporal composition of intermediate optimal sub-policies) motivates joint learning of discrete latent codes, routing policies, and model parameters in one training stage. In low-data fine-tuning across four datasets and six models, DLR matches or outperforms SFT with a mean +6.6 pp gain. Prior discrete-latent baselines consistently underperformed SFT; DLR is the first to flip that result. The mechanistic-analyses claim is the interesting part: targeted code ablations show that learned codes have distinct causal roles. This connects directly to WriteSAE: if the latent codes are causally distinct, they are addressable for behavioral interventions in the same way SAE features are.
Why it matters: The routing surface now has six addressable internal layers (model, adapter, expert, distillation loss, decoding head, latent code) plus an orthogonal profile-design axis. The next composition (and the one no paper has shipped yet) is the joint routing problem with profile design as a first-class input: which model under which profile representation → which adapter → which cache eviction policy → which decoding head → which latent code → which distillation loss.
Research angle: (1) Profile-router co-training. RouteProfile keeps the router fixed and studies the profile. The natural extension is joint optimization. (2) DLR composed with MinT. Train per-task adapters with DLR-learned codes; route between adapters using the codes as profile signal. Unifies training-time and deployment-time routing. (3) Code-space interpretability. If codes are causally distinct, can they be assigned semantic labels (style, domain, reasoning depth)? Bridge between routing and interpretability literatures.
→ RouteProfile summary · DLR summary
Forcing-KV: head-role-conditioned cache compression for video diffusion
The KV-cache thread has moved from "compress uniformly" (2024) to "evict learned" (Make Each Token Count, 05-12) to "compress per head role" (today). Each step is policy-aware in a different way. The cache is now a programmable substrate, not a storage layer, in both LLM and video-diffusion regimes.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. KV cache, video diffusion, GPU memory efficiency
Standard cache (uniform) Forcing-KV (role-conditioned)
─────────────────────────── ────────────────────────────────────
60+ GB cache for 30s 1080P static heads ─► structured pruning
uniform attention complexity dynamic heads ─► segment-similarity
pruning
↓ ↓
Dummy Forcing head-role split is stable across
prunes aggressively, gets samples and denoising steps
flicker and broken transitions
29+ fps on H200, 30% memory
reduction, 2.82x at 1080P
The empirical observation that load-bears: attention heads in mainstream AR diffusion models (Self Forcing family) exhibit two stable functional roles that hold across samples and denoising steps. Static heads attend to chunk transitions and intra-frame fidelity; dynamic heads govern inter-frame motion and temporal consistency. The split is stable enough to make role assignment a static analysis decision rather than an online policy. Each head class gets its own compression rule: structured pruning where information is locally redundant, segment-similarity-based pruning where temporal coherence matters.
The 2.82x speedup at 1080P composes with two other pieces of the stack landing this week: NVIDIA's NVFP4 Kimi-K2.6 quantization release (HF) on the per-token-bytes axis, and the asynchronous continuous batching blog on the scheduling axis. Three independent improvements in one week. None are model changes.
The deeper read connects to WriteSAE: head specialization is now observable, exploitable, and architecture-portable. WriteSAE found that recurrent-state heads can be addressed via rank-1 atomic interventions at the cache-write site. Forcing-KV finds that AR-diffusion attention heads can be addressed via role-conditioned compression. Same pattern: heads have stable functional roles, and the role is the addressable unit.
Why it matters: 30% memory reduction at 29+ fps on a single H200 is the difference between consumer-GPU feasibility and not. Combined with SANA-WM (today, 60-second 720P on RTX 5090 with NVFP4 quantization), long-form streaming video is now on track for consumer hardware.
Research angle: (1) Cross-architecture head-role transfer. Static-vs-dynamic head roles found in Self Forcing family; whether the same dichotomy holds in non-AR diffusion or pure-AR video generators is unstudied. (2) Online role identification. The paper presents role assignment as a static decision; a scheduler that re-identifies role per workload is the obvious online variant. (3) Composition with Orthrus. If two generation heads share a cache (Orthrus) and the cache compresses per head role (Forcing-KV), the cache is now both shared and selectively compressed. Components compose cleanly; the architecture has not been built.
Asynchronous continuous batching: 22% from scheduling alone
Continuous batching was a 2023 production workhorse. Two years later, the scheduling primitive gets one structural upgrade (async CPU-GPU overlap) that returns 22% with no model or kernel changes. The substrate has more headroom than the architecture papers suggest.
Source: HuggingFace blog Links: Blog post · Wiki Tier: 1. GPU optimization, continuous batching, scheduling
Synchronous batching Async continuous batching
──────────────────── ────────────────────────────
CPU prepares batch N+1 ─► 3 CUDA streams (H2D, compute, D2H)
↓ ─► non-default, return control to CPU
GPU idles CUDA events for ordered handoff
↓ ─► no CPU block
GPU computes batch N Dual buffer slots A and B
↓ ─► CPU writes B while GPU reads A
CPU idles Carry-over mask transfers tokens
↓ ─► batch N's output → batch N+1's input
24% of time spent serialized
utilization 76.0% utilization 99.4%
generation 300.6s generation 234.5s (22% faster)
The HuggingFace transformers continuous-batching implementation now ships an asynchronous version using three CUDA streams (H2D, compute, D2H), CUDA events for ordered handoff (h2d_stream.record(h2d_done), compute_stream.wait(h2d_done), d2h_stream.wait(compute_done)), and two parallel buffer slots so the CPU can prepare batch N+1 while the GPU reads batch N from slot A. A carry-over mask transfers freshly generated tokens from batch N's output to batch N+1's input via tensor ops, using placeholder zeros initially populated as the GPU finishes. Multiple captured graphs share one memory pool.
The 22% number is close to the 24% theoretical ceiling (eliminating CPU overhead entirely); the residual gap is the unavoidable CPU sync for sampling outputs. There is no obvious further compression on this dimension without GPU-side sampling.
Cross-source signal: this drops the same week as TurboQuant practitioner study (r/LocalLLaMA), NVIDIA's NVFP4 Kimi-K2.6, and Forcing-KV. Three pieces of the inference stack updating in one week: scheduling, quantization, head-role cache compression. None are model changes.
Why it matters: Async batching for RL rollouts is the immediate composition. The 16K+ generation lengths the post mentions are exactly the RL post-training regime. NeMo-RL speculative rollouts (1.77x) composed with async continuous batching (1.22x) is a 2x+ wall-clock training improvement on the rollout phase, typically 60-70% of RL training cost.
Research angle: (1) Async continuous batching for RL rollouts (above). (2) Whether vLLM adopts async continuous batching is the production-side watch. (3) GPU-side sampling closes the last 2 pp.
WildClawBench + Orchard + SDAR + EvoEnv: the agentic stack converges
WildClawBench puts the agentic-eval ceiling at 62.2% under native runtime and finds an 18-point spread from harness choice alone. Orchard hits 67.5% on SWE-bench Verified at 30B with credit-assignment SFT + balanced RL. SDAR gates OPSD inside multi-turn RL (+9.4% over GRPO). EvoEnv constructs verifiable environments instead of generating data. Four pieces of the agentic post-training stack landing in one batch.
Sources: HuggingFace Daily Papers (all four) Links: WildClawBench paper · wiki · Orchard paper · wiki · SDAR paper · wiki · EvoEnv paper · wiki Tier: 2 (treated as Tier 1 in space because four-paper convergence on the agentic post-training axis)
Evaluation Infrastructure Training recipe Environment
────────────── ───────────────── ─────────────────── ─────────────────
WildClawBench Orchard SDAR EvoEnv
62.2% best (Opus 4.7) Open-source K8s Gated OPSD as Construct verifiable
18-pt harness shift environment service auxiliary; RL primary environments instead
60 native-runtime Orchard-SWE: 67.5% +9.4% ALFWorld of generating data.
tasks, 20 tool calls on SWE-bench Verified over GRPO Asymmetry condition
each, hybrid grading as structural invariant
│ │ │ │
└──────────────────────────┴───────────────────────┴───────────────────────┘
│
same week: a complete agentic
post-training stack arrives in pieces
WildClawBench changes what "frontier model" means in deployment. 60 bilingual multimodal tasks averaging 8 min wall-clock and 20+ tool calls, run inside reproducible Docker containers hosting actual CLI agent harnesses (OpenClaw, Claude Code, Codex, Hermes Agent) with real tools, not mocks. Hybrid grading: deterministic checks, environment-state auditing, LLM/VLM semantic judge. Best across 19 frontier models: Claude Opus 4.7 at 62.2% under OpenClaw. Every other model below 60%. Switching harness alone shifts a single model by up to 18 points. The harness is the load-bearing layer, not the model. If routing systems are choosing on SWE-bench Verified scores, they are choosing on a metric that does not reflect native-runtime performance.
Orchard ships the corresponding training-side recipe. Open-source Kubernetes-native environment service (Orchard Env) plus three recipes (SWE, GUI, Claw). Orchard-SWE on Qwen3-30B-A3B-Thinking: 64.3% on SWE-bench Verified after SFT, 67.5% after SFT+RL. Open-source SOTA at 30B. The load-bearing ideas: credit-assignment SFT learns from productive segments of unresolved trajectories (107K distilled from MiniMax-M2.5 and Qwen3.5-397B); Balanced Adaptive Rollout handles sparse-reward RL. Whether Orchard-SWE's 67.5% holds under WildClawBench's native-runtime grading is the natural follow-up. Given WildClawBench's 18-point harness spread, expect Orchard-SWE to be sensitive to the harness it was trained against.
SDAR addresses the multi-turn agent RL stability problem. OPSD (On-Policy Self-Distillation) helps with sparse trajectory rewards but destabilizes under compounding multi-turn instability and skill-conditioned privileged guidance. SDAR keeps RL primary and uses OPSD as a gated auxiliary: a sigmoid gate over detached token-level signals strengthens distillation on teacher-endorsed positive-gap tokens and softly attenuates negative rejections. +9.4% ALFWorld, +7.0% Search-QA, +10.2% WebShop-Acc over GRPO; avoids naive GRPO+OPSD instability. The gate mechanism is the same shape as yesterday's Extrapolation Cliff: selective use of dense teacher signal, gated by a structural quantity.
EvoEnv is the most ambitious framing of the four. Self-improving RL via verifiable environment synthesis. The model constructs environments (Python programs that sample instances, compute references, score responses), not data. Sustains improvement only if environments exhibit solve-verify asymmetry: model can write an oracle once that it cannot reliably execute in natural language on fresh instances. Two sources: algorithmically hard but trivial as code (DP, graph traversal); intrinsically hard to solve but easy to verify (planted subset-sum, CSP). On already-strong Qwen3-4B-Thinking: fixed-data RLVR reduces the average score; EvoEnv improves it from 72.4 to 74.8 (+3.3% relative). The strong-regime gain is the headline. Most self-improvement methods help weak models or not at all when the baseline is strong.
Read together, the four papers form a complete agentic post-training stack. WildClawBench measures honestly; Orchard provides training infrastructure; SDAR provides a stable multi-turn RL recipe; EvoEnv provides the substrate (verifiable environments) for sustained self-improvement. None of these papers proposes the joint composition. The composition that no paper has written: train an agent with Orchard's credit-assignment SFT and SDAR's gated OPSD, on EvoEnv-constructed environments, evaluated under WildClawBench's native runtime. Four separate improvements; the joint paper hasn't been written.
Why it matters: The open-source agent stack jumps in capability today (Orchard 67.5%), in evaluation honesty (WildClawBench 62.2% ceiling), in training stability (SDAR's gated OPSD), and in long-term self-improvement substrate (EvoEnv's asymmetry condition). The agentic stack just got four times more legible.
Research angle: (1) The four-paper composition above. (2) AgentLens applied to WildClawBench trajectories. Yesterday's 10.7% Lucky-Pass rate on SWE-bench Verified should be re-measured on WildClawBench's native-runtime trajectories. (3) EvoEnv + Darwin Family. Today's Darwin Family merges checkpoints without training; EvoEnv constructs verifiable environments. The composition is RSI-as-engineering: evolve environments, merge checkpoints, score on WildClawBench. Closest thing the wiki has to a falsifiable RSI research program.
→ WildClawBench wiki · Orchard wiki · SDAR wiki · EvoEnv wiki
SU-01 + Darwin Family: compact reasoning models at frontier-tier scores
SU-01 specializes a 30B-A3B backbone to gold-medal IMO/USAMO/IPhO with 200 RL steps and a unified three-stage recipe. Darwin Family hits 86.9% GPQA Diamond (rank #6 of 1,252 models) with no gradient training. Two papers in one day reframing "scale" as recipe-and-search problems.
Sources: HuggingFace Daily Papers Links: SU-01 paper · wiki · Darwin paper · wiki Tier: 2. Reasoning models, post-training recipes, training-free composition
SU-01 (Shanghai AI Lab) achieves gold-medal level on IMO 2025 (35 points), USAMO 2026 (35 points, 10 above gold line), and IPhO 2024/2025 with a 30B-A3B model and only 200 RL steps. Three-stage recipe: (1) Rigorous SFT on 340K sub-8K-token trajectories using a reverse-perplexity curriculum to instill proof-search and self-checking; (2) Two-Stage RL (Coarse with verifiable rewards, then Refined with generative rewards + self-refinement + experience replay); (3) test-time scaling via self-verification and refinement (lifts IMO-ProofBench from 57.6% direct to 70.2%). The "specializable-generalist" framing has its first frontier-tier datapoint: with the right recipe, a broadly capable compact backbone can be specialized to expert-level proof reasoning while retaining cross-domain transfer (the IPhO 2024/2025 gold is on a different domain). 200 RL steps is one to two orders of magnitude fewer than typical RLVR pipelines for reasoning models, which is the most interesting line in the paper.
Darwin Family takes the orthogonal route: no training at all. Training-free evolutionary merging of large LMs via gradient-free weight-space recombination, with three ideas: 14-dimensional adaptive merge genome (component- and block-level recombination), MRI-Trust Fusion (learnable trust parameter balancing interpretability signals against evolutionary search), Architecture Mapper (cross-architecture breeding between Transformer and Mamba families). Flagship Darwin-27B-Opus: 86.9% on GPQA Diamond, rank #6 of 1,252 evaluated models, beating its fully-trained foundation without any gradient-based training. Recursive multi-generation evolution supported. The cross-architecture Transformer+Mamba result is the structural novelty: prior cross-arch composition needed from-scratch training; Darwin makes it a post-hoc operation on existing checkpoints.
The two papers are complementary, not competing. SU-01 says training is cheap if the recipe is right; Darwin says training is unnecessary if the existing model ecosystem is rich enough. Both undermine the "scale is destiny" framing from a different direction.
Why it matters: The economics of compact reasoning models change. Today's wiki gains its first 200-step RL gold-medal recipe and its first training-free GPQA #6 score. If reproducible, both shift the cost floor of frontier reasoning by 1-2 orders of magnitude.
Research angle: (1) SU-01's reverse-perplexity curriculum applied to coding / agentic post-training. Untested. (2) Darwin recursive merging stability. Whether merged models drift (Cliff-style format collapse) under many generations is open. (3) SU-01 + Darwin composition. Train a single 30B model with SU-01's recipe; merge it with Darwin against other strong reasoning checkpoints. Untested.
→ SU-01 summary · Darwin summary
Industry Pulse
- Subquadratic ships Appen-validated benchmark results (Gmail-starred). Appen independently validated SubQ as state-of-the-art on long-context retrieval and ultra-long context: 56.2x wall-clock speedup vs FlashAttention-2 at 1M tokens, 62.8x FLOP reduction vs dense attention at 1M, 95.6% RULER at 128K, 86.2% on MRCR 8-needle at 512K-1M, 81.8% SWE-bench Verified with extended thinking. LayerLens partnership for continuous benchmarking across ~100 evals. The 1M-token speedup is the most aggressive subquadratic-attention production claim in the wiki; pairs directly with Lighthouse Attention (retweet, 05-12) on the subquadratic-train, vanilla-deploy thread, but Subquadratic claims production-scale numbers.
- xAI ships Grok Build CLI for SuperGrok Heavy (@xai, x.ai/cli). An agentic CLI for coding, building apps, automating workflows. Early beta. Three xAI employees (@JasonBud, @milichab, others) post videos showing plan mode, subagent dispatch, /imagine and /imagine-video commands. The CLI agent category is now a fourth-frontier-lab race: Claude Code, OpenAI Codex desktop, Anthropic Claude Code, and now Grok Build. WildClawBench (today) tests four CLI harnesses, all proprietary except OpenClaw. The Grok Build entry adds a fifth.
- NVIDIA releases NVFP4 quantization for Kimi-K2.6 and Kimi-K2.5 (HF Kimi-K2.6-NVFP4, r/LocalLLaMA). Quantization-quality table: NVFP4 hits 90.4 GPQA Diamond vs 90.9 INT4 baseline, 54.4 SciCode (better than INT4's 52.6), 76.5 MMMU Pro. Same week as TurboQuant practitioner study (vLLM blog via r/LocalLLaMA). NVFP4 is now the production-tier 4-bit format on Blackwell; the Kimi-K2 family is the first frontier MoE to get an NVFP4 release.
- RTX 5000 PRO 48GB is the new local-inference sweet spot (r/LocalLLaMA). Practitioner report: ~$4,300, runs Qwen3.6-27B-FP8 with full-precision cache via vLLM. The post explicitly notes that the build process required Claude Code to navigate Linux + vLLM setup. Local-inference deployment is now Claude-Code-assisted by default.
- EU GPU price tracker: RTX 5090 is the only tier going up (r/LocalLLaMA). 50-day, 15-store, 126K-reading study. RTX 5090 +3.0% (€3,392 → €3,487); every other tier (-0.4% to -9.1%). Algorithmic micro-pricing observed: 45 distinct prices on a single GPU over 15 days, all within a €0.99 range. AI/workstation demand absorbs 5090 supply fast enough to prevent normalization. NVIDIA also reportedly preparing 5090 price hike amid GDDR7 costs (TechPowerUp).
- inclusionAI ships Ring-2.6-1T (HF via r/LocalLLaMA). 1T-parameter model release. Adds to the trillion-scale open-weights tier (Kimi-K2 family, Qwen3.5-397B, MiniMax-M2.5). No deep dive yet.
- VS Code "Agents window" requires GitHub Copilot subscription to use local models (VS Code docs, r/LocalLLaMA discussion). Even when running locally, the feature is gated behind Copilot. Friction point for local-only workflows. Pairs with the harness-as-load-bearing thread from WildClawBench: the agent interface is increasingly the value-capture point.
Connecting the Dots
Today's research papers Today's industry / social
─────────────────────── ─────────────────────────
routing design space: Subquadratic 1M-token (Gmail)
RouteProfile (deployment-time) 56.2x vs FlashAttention-2
DLR (training-time) ──────────────► 81.8% SWE-bench Verified
▲
KV cache as substrate: │
Forcing-KV (head-role) ◄────────────── NVFP4 Kimi-K2.6 (NVIDIA)
async continuous batching TurboQuant practitioner study
(HF blog) (r/LocalLLaMA + vLLM)
agent memory cluster: Grok Build CLI (xAI)
STALE (eval, 55.2%) ▲
Preping (construction) ──────────────► harness-as-value-capture
EvolveMem (self-evolving infra) confirmed: VS Code Copilot
MemEye, MemLens (multimodal eval) gate on local models
BOOKMARKS (storyline)
agentic post-training stack: RTX 5000 PRO 48GB
WildClawBench (62.2% ceiling) deployed via Claude Code
Orchard (67.5% SWE-bench) ──────────────► "local inference is now
SDAR (gated OPSD) Claude-Code-assisted by default"
EvoEnv (verifiable environments)
compact reasoning at scale:
SU-01 (200 RL steps, gold IMO)
Darwin (training-free, GPQA #6)
Cross-paper thread #1: memory is the next layer up from KV cache, and the eval ceiling is structural. Six papers in one day make agent memory a cluster, not a coincidence. STALE caps best frontier at 55.2% on implicit-conflict detection. MemLens caps multi-session multimodal at below 30% across 27 LVLMs and 7 memory-augmented agents. The pattern is the same as the cache thread two months ago: the substrate is more programmable than current systems treat it. EvolveMem self-evolves retrieval configuration via AutoResearch (+25.7% on LoCoMo); Preping constructs memory pre-task (2-3x cheaper than online); δ-mem (05-13) is the lightweight associative-memory baseline. The wiki adds an agent-memory concept page today; one was overdue.
Cross-paper thread #2: the routing design space opens on two new axes. RouteProfile (deployment-time profile design as the missing variable) and DLR (training-time joint latent-code-and-routing-policy learning) bracket the routing decision from both ends. Together with yesterday's MinT and the prior routing papers, the routing surface has six addressable internal layers plus an orthogonal profile axis. The next composition (the joint routing problem with profile design as input) has not been written.
Cross-paper thread #3: the inference stack updates in pieces, no model changes. Async continuous batching (HuggingFace, 22% from scheduling), Forcing-KV (head-role compression for video diffusion), NVFP4 Kimi-K2.6 (NVIDIA quantization), TurboQuant practitioner study (vLLM + r/LocalLLaMA). Four independent improvements in one week. The pattern from yesterday's Energy-to-Token position paper is now playing out: when the binding constraint is energy, throughput-per-watt improvements compound across the scheduling, compression, and quantization axes. None require model changes.
Cross-paper thread #4: the agentic stack converges on an honest evaluation + complete training recipe. WildClawBench caps frontier models at 62.2% under native runtime with an 18-point harness sensitivity. Orchard provides the training infrastructure that hits 67.5% on SWE-bench Verified at 30B (open-source). SDAR provides a stable multi-turn RL recipe (gated OPSD). EvoEnv provides the long-term substrate (verifiable environment synthesis with solve-verify asymmetry as invariant). Four pieces of the agentic post-training stack in one day. The joint composition has not been written. The natural follow-up: AgentLens process labels applied to WildClawBench trajectories to measure how much of the 62.2% ceiling is Lucky-Pass.
Cross-paper thread #5: compact reasoning models reach frontier-tier scores via recipe or search, not scale. SU-01 hits gold-medal IMO/USAMO/IPhO with 200 RL steps on a 30B-A3B model. Darwin Family hits GPQA Diamond rank #6 of 1,252 models with no training at all. Two papers in one day undermine the "scale is destiny" framing from opposite directions: SU-01 says training is cheap with the right recipe; Darwin says training is unnecessary with a rich enough model ecosystem. Pair with Make Each Token Count and LongAct on the "selective is better than dense" thread: recipe quality dominates volume.
Cross-paper thread #6: the eval-ceiling pattern continues a fourth day. Yesterday: AgentLens (10.7% Lucky), AssetOpsBench (-0.13 correlation), Soohak (refusal). Today: WildClawBench (62.2% best, 18-pt harness), STALE (55.2%), MemLens (<30%). Four days in a row, every honest evaluation lens lowers the previously-reported ceiling. The cyber-eval doubling-rate from AISI (05-13) has still not been measured under any of these lenses.
Reddit + Gmail signal: the Subquadratic Appen validation (Gmail-starred) pairs with practitioner reports of TurboQuant + NVFP4 + Forcing-KV-style cache compression as a deployable production-inference stack. The xAI Grok Build CLI release (Twitter morning slot) adds a fifth frontier-lab CLI agent to the harness-as-load-bearing thread that WildClawBench formalizes. The VS Code "Agents window" requiring Copilot to run local models is the corporate confirmation that the harness, not the model, is where value capture lives. Three independent signals in one batch on the same diagnosis.
Media-Live morning slot synthesis (2026-05-15): see morning synthesis. Today's morning slot has no @bayesiansapien retweets; the AI handle feed is dominated by xAI's Grok Build CLI launch (three xAI staff plus the @xai mainline). NVIDIA pushes a CMU commencement video. Lex Fridman travel announcements (off-topic). Tesla Cybertruck marketing. Skip rate is high; Grok Build CLI is the single signal worth click-through.
Worth Watching
- AgentLens applied to WildClawBench trajectories, 60 days. Yesterday's 10.7% Lucky-Pass rate was measured on SWE-bench Verified. WildClawBench has process traces under native runtime with 20+ tool calls each. Re-measuring the Lucky-Pass fraction is one paper away. Falsifiable: a paper that reports Lucky-Pass rate on WildClawBench above 5%. If true, the 62.2% ceiling is even softer than the harness-sensitivity data suggests.
- EvolveMem + STALE composition, 60 days. EvolveMem auto-discovers retrieval configurations; STALE probes implicit conflicts. A retrieval policy that EvolveMem evolves specifically to detect stale state using STALE-style probes as training signal is the natural composition. Falsifiable: a paper that reports +X% on STALE State Resolution via EvolveMem-discovered retrieval policy.
- Online λ-star scheduler + SDAR gate, 60 days. The Extrapolation Cliff (05-14) gives a closed-form for OPD; SDAR uses a learned gate for multi-turn OPSD. Deriving a closed-form gate from the Cliff's three observables (p, b, c) extended to multi-turn is a one-paper rewrite. Falsifiable: a paper that ships this and shows fewer instability incidents than SDAR's learned gate.
- EvoEnv + Darwin Family composition, 120 days. EvoEnv constructs verifiable environments with solve-verify asymmetry; Darwin merges checkpoints without training. The composition is the closest thing to RSI-as-engineering the wiki has tracked. Falsifiable: a paper that runs evolve-and-merge cycles on a single backbone family with measured stability and reports gains above either component alone.
- Async continuous batching for RL rollouts, 90 days. NeMo-RL speculative rollouts (1.77x) composed with async continuous batching (1.22x) is a 2x+ wall-clock training improvement on the rollout phase, typically 60-70% of RL training cost. Falsifiable: an RL post-training paper that reports both numbers multiplicatively.
- Cross-architecture head-role transfer, 90 days. Forcing-KV finds static-vs-dynamic head roles stable across samples and timesteps in Self Forcing AR diffusion. Whether the same dichotomy holds in non-AR diffusion or pure-AR video generators is unstudied. Falsifiable: a paper that finds the same head-role split in a different architecture family.
- WildClawBench re-evaluation of Orchard-SWE, 60 days. Orchard reports 67.5% on SWE-bench Verified. WildClawBench would put that number under native-runtime grading. Given WildClawBench's 18-point harness spread, expect a significant drop. Falsifiable: a number, either by Orchard's authors or by an independent run.
- LLM-rated underrated from Kurate (this week's cs.AI and cs.LG leaderboards): cs.AI #5 "AI scientists produce results without reasoning scientifically" (ai_rating 8.5), cs.AI #13 "Emotion Concepts and their Function in a Large Language Model" (8.2, with William Saunders and Tom Henighan), cs.LG #4 "A Theory of Generalization in Deep Learning" (6.5 but high score), cs.LG #11 "LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit" (ai_rating 8.0 — actionable interpretability). The sycophancy-lying shared-circuit paper is the most actionable for responsible-ai: one intervention may address two failure modes. None appeared in HF today, all are top-15 on Kurate.
- Rising authors from Kurate: no authors crossed threshold this week. No new handles to add to
connectors/twitter/config.json:ai_handles.
Quick Hits
DiffusionOPD (arXiv 2605.15055). Multi-task on-policy distillation for diffusion T2I models. Trains task-specific teachers, distills into a unified student along the student's own rollout trajectories. Theoretical contribution: lifts OPD from discrete tokens to continuous-state Markov processes; derives a closed-form per-step KL objective unifying SDE and ODE refinement via mean-matching. Shows the analytic gradient has lower variance than PPO-style policy gradients. Surpasses multi-reward RL and cascade RL baselines. The diffusion-side analogue of yesterday's Extrapolation Cliff (closed-form for OPD in LLMs). Tier 2 with Tier-1 implications for cross-modal distillation.
Causal Forcing++ (arXiv 2605.15141). Frame-wise autoregression with 1-2 sampling steps for real-time interactive video. Causal consistency distillation (causal CD) gives single-online-teacher-ODE-step supervision, avoiding stored full-PF-ODE trajectories. Surpasses 4-step chunk-wise Causal Forcing at frame-wise 2-step setting: +0.1 VBench Total, +0.3 VBench Quality, +0.335 VisionReward, 50% first-frame latency reduction, ~4x Stage-2 training cost reduction. Tier 3.
RAVEN (arXiv 2605.15190). Real-time autoregressive video extrapolation with consistency-model GRPO. Training-time test framework repacks each self-rollout into interleaved clean historical endpoints and noisy denoising states. CM-GRPO reformulates a consistency sampling step as a conditional Gaussian transition and applies online RL directly. Tier 3 video gen, but the consistency-model GRPO formulation generalizes to other flow-model RL settings.
SANA-WM (arXiv 2605.15178). 2.6B-parameter open-source world model trained for one-minute 720P video generation. Hybrid linear attention (frame-wise Gated DeltaNet + softmax), dual-branch camera control, two-stage generation pipeline. Trained in 15 days on 64 H100s on ~213K public video clips; distilled variant runs on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720P clip in 34 seconds. 36x higher throughput than open baselines at comparable quality. Tier 3 vision, but the hybrid linear attention + consumer-GPU deployment closes the loop on this week's local-inference-via-Mamba-hybrid thread.
ATLAS (arXiv 2605.15198). Visual reasoning via functional tokens: a single discrete "word" serves as both agentic operation and latent visual reasoning unit. Generated via next-token prediction; no architectural modifications, no visual supervision. LA-GRPO (Latent-Anchored GRPO) stabilizes RL with a statically weighted auxiliary objective. Avoids the context-switching latency of agentic visual reasoning and the task-generalization weakness of latent reasoning. Tier 3.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning (CLVR) (arXiv 2605.14876). Closed-loop visual reasoning: VLM logical planning + pixel-level diffusion + automated step-level visual verification + Proxy Prompt RL for long-context optimization stability + Delta-Space Weight Merge (DSWM) for 4-NFE inference. Outperforms open-source baselines, approaches proprietary. Tier 3 multimodal generation.
Warp-as-History (arXiv 2605.15182). Camera-controlled video generation from one training video. Turns camera-induced warps into pseudo-history with target-frame positional alignment. Zero-shot follows camera trajectories with frozen base; lightweight LoRA on one annotated video generalizes. No test-time optimization or architectural modifications. Tier 3.
RewardHarness (arXiv 2605.08703). Self-evolving agentic reward framework: instead of training reward models on hundreds of thousands of comparisons, evolve a tool/skill library from as few as 100 preference demonstrations. Orchestrator + frozen Sub-Agent + tool library refinement. Using 0.05% of EditReward data, achieves 47.4% on image-editing evaluation benchmarks, surpassing GPT-4o. Tier 3, useful for the agent-reward-modeling axis.
Beyond Individual Intelligence (LIFE survey) (arXiv 2605.14892). Multi-agent collaboration survey organized by the LIFE progression: Lay capability → Integrate via collaboration → Find faults via attribution → Evolve via self-improvement. Identifies cross-stage open challenges. Tier 2 reference work; complement to yesterday's Bystander Effect / Sovereignty Gap finding (multi-agent agreement under pressure suppresses correct answers).
MemEye and MemLens detailed in the agent memory cluster summary. Multimodal memory benchmarks both find caption-only short-cuts and visual-fidelity loss as core failure modes.
BOOKMARKS (arXiv 2605.14169). Active storyline memory for role-playing. Abstract not available at farm time; title is enough to place in the role-play storyline memory category. Tier 3 unless follow-up evidence elevates.
FutureSim (arXiv 2605.15188) and Nexus (arXiv 2605.14389). Abstracts not available at farm time. FutureSim is "replaying world events to evaluate adaptive agents" by title. Nexus is "agentic framework for time series forecasting." Skip until abstracts surface.
Simon Willison's "Not so locked in anymore" (blog). Mitchell Hashimoto quote on Bun migrating from Zig to Rust as evidence that programming languages are decreasingly lock-in. Simon adds a conference anecdote about a coding-agent-driven React Native rewrite of legacy iPhone+Android apps: chosen because "if it turned out to be the wrong decision, they could just port back to native in the future." Coding agents reduce the cost of language and framework migration; lock-in is the new soft constraint.
Simon Willison's datasette-ip-rate-limit 0.1a0 (blog, release). Codex GPT-5.5 xhigh built the plugin to block IPs hammering datasette.io. Useful primitive for the agent-served-API thread.
Ken Huang Chapter 1: Hermes Agent Cost & Token-Usage Accounting (Substack). Detailed walkthrough of cost tracking in Claude Code (per-model tier table, session accumulator, OpenTelemetry counters, $5 threshold dialog, resume-aware session-ID gating). Hermes Agent normalizes 200+ models from 6 providers into CanonicalUsage + CostResult via Decimal pricing math. Production-agent plumbing; the kind of pattern that should be standardized in the harness layer the WildClawBench finding now makes load-bearing.
Pragmatic Engineer Pulse on FDE (newsletter). Forward-deployed engineering hiring at Google, OpenAI, Anthropic. Mirrors the deployment-services-as-a-category thread (OpenAI Deployment Company 05-11, Google customer-adoption engineers 05-13).
Granite Embedding Multilingual R2 (HF blog). IBM Apache-2.0 multilingual embeddings, 32K context, "best sub-100M retrieval quality." No deep dive; relevant for RAG / agent-memory retrieval stack.
Sources ingested today: HF (26 papers), RSS (8 new posts for 2026-05-14: TLDR AI, Ken Huang agentic-ai, Pragmatic Engineer, 2 HuggingFace blog, 3 Simon Willison), Gmail (1 starred: Subquadratic Appen benchmark), Twitter morning slot (10 tweets / 0 retweets / 8 articles) + 05-14 evening (3 tweets) + 05-14 afternoon (1 tweet), Kurate cs.AI + cs.LG weekly leaderboards (no rising authors crossed threshold, no HF cross-source-confirmed papers), Reddit (8 subs scraped; 4 yielded posts after filters: LocalLLaMA 12, MLScaling 4, MachineLearning + CUDA + LLMDevs + ControlProblem + HPC + reinforcementlearning all empty), parallel Daily-Digest (none for 2026-05-15 in /Users/amitsinghbhatti/Documents/Claude/Projects/Daily-Digest/, latest is 2026-04-23) | Wiki pages updated: 10 (4 Tier 1 summaries: RouteProfile, DLR, Forcing-KV, async continuous batching; 4 Tier 2 agentic: WildClawBench, Orchard, SDAR, EvoEnv; 1 agent-memory cluster summary; 2 LLM summaries: SU-01, Darwin Family; 3 concept page updates: kv-cache.md, llm-routing.md, new agent-memory.md)