May 14, 2026 · daily digest

cere-bro | 2026-05-14

cere-bro | 2026-05-14

Three papers today land near the same axis at three different layers of the stack: MinT routes adapters at the catalog layer, Orthrus routes generation at the cache layer, the Extrapolation Cliff routes distillation at the loss layer. Pair that with the SemiAnalysis Cerebras piece and the Energy-to-Token position paper and the question stops being "how fast can the GPU go" and starts being "what is the binding constraint."


TL;DR


The Big Picture

The thread tying today's research is "where does the routing decision live." MinT puts it at the adapter catalog: million-scale LoRA revisions over one resident base, with the catalog as a first-class infrastructure surface. Orthrus puts it at the KV cache: two generation heads sharing one cache, with the cache as the coordination object. The Extrapolation Cliff puts it at the distillation loss: a closed-form threshold for where the student stops being safely correctable. Three papers in one day each push the same architectural move (sparse, locatable substructure as the right unit of work) down to a different layer of the inference and training stack. The wiki has been tracking this move since TIP (04-16) and LongAct (04-18). It is now eight papers strong across distillation, RL, merging, KV cache, parallel decoding, and adapter-serving.

The Energy-to-Token position paper makes a structurally different claim that lands on the same day, and it changes the value of those eight papers. If inference is energy-bounded rather than FLOPs-bounded, all the gains from selective eviction, parallel decoding, and catalog-level adapter routing convert directly into energy savings rather than throughput numbers. The SemiAnalysis Cerebras piece dropped 18 hours before this position paper and argues exactly that thesis empirically for the wafer-scale-engine class of hardware. The arXiv side now has a formal version of the same argument. Together they raise the question of what an "InferenceMax with watts" benchmark would look like. The team that ships that benchmark sets the new evaluation standard.

The third thread is benchmark integrity. Soohak's refusal subset on 05-12 said frontier models confidently answer ill-posed math problems. AgentLens today shows 10.7% of passing SWE-bench trajectories are Lucky. AssetOpsBench reports public-to-hidden score correlation of −0.13. Three papers in three days saying the same thing from different angles: aggregate leaderboard metrics over-aggregate, and pass-rate is misleading for between-model comparison. The cyber-eval doubling-rate trend from AISI (2026-05-13) gets harder to interpret in this light. How much of the doubling is Lucky Passes?


Deep Dives


MinT: million-scale LoRA serving over one 1T-class base

The catalog axis is now a routing surface. Holding 10^6 adapter revisions over a resident base is a managed-infrastructure problem, not a training trick.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. Inference efficiency, RL post-training infra, multi-policy serving

   Standard adapter pipeline           MinT pipeline
   ─────────────────────────           ─────────────────────────────
   train adapter ─► merge ─►           train adapter ─┐
   materialize full checkpoint         (under 1% of base size)
   ─► serve ─► rollback hard           │
                                       ▼
                                       managed lifecycle
                                       (rollout / update / export /
                                        eval / serve / rollback)
                                       │
                                       ▼
                                       one resident base (1T-class)
                                       10^6-scale adapter catalog
                                       packed MoE LoRA tensors
                                       cold-load as scheduled service

MinT treats LoRA adapters as the unit of deployment. Three scaling axes named in the paper: Scale Up to frontier dense and MoE (including MLA and DSA attention paths, validated beyond 1T total parameters); Scale Down by moving only the exported adapter (under 1% of base size in rank-1, giving an 18.3x measured step-time reduction on 4B dense and 2.85x on 30B MoE; concurrent multi-policy GRPO shortens wall time 1.77x at matched peak memory); Scale Out by separating durable policy addressability from CPU/GPU working sets (10^6 addressable catalog entries, 8.5-8.7x faster live engine loading with packed MoE LoRA tensors).

The reason this matters is not the individual numbers, it is what becomes possible at million-scale. The wiki has been tracking routing systems since TraceR, CARE, and Sakana Conductor. Those papers route between models. MinT makes the routing decision within a single base, across an adapter catalog. That converts "which model" into "which adapter," which is a different and arguably cheaper routing surface. The CARE paper put bi-level routing at the expert layer; MinT puts it at the adapter layer. Together they bracket "where does the routing decision live" between the expert and adapter axes.

The wall-time number is the production-relevant one. 1.77x speedup on concurrent multi-policy GRPO at matched peak memory means an RL post-training pipeline that trains many policies in parallel under the same memory budget. This composes directly with NeMo-RL speculative-decoding rollouts (2026-04-30, 1.77x generation): the same compute serves multiple policies, each of which is itself faster. Two multiplicative gains on the rollout-cost axis.

Why it matters: LoRA was a fine-tuning trick. MinT makes it the catalog axis of a serving fleet. The unit economics of running specialized policies just changed.

Research angle: Three open problems. (1) Catalog routing as a learned policy: 10^6 adapters is too many for hand-tuned dispatch rules; the routing decision is now an RL problem in itself. (2) Adapter quality estimation under continual update, the pipeline produces many revisions per policy and there is no quality-selection mechanism in the paper. (3) Composition with Make Each Token Count's learned eviction: is the eviction policy per-adapter or shared across the catalog? Untested.

Full summary


Orthrus: dual-view AR+diffusion on a shared KV cache

The cache becomes the coordination object between two generation heads. Exact consensus means the output is bit-identical to autoregressive. 7.8x speedup. O(1) cache overhead.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. KV cache, parallel decoding, lossless inference acceleration

                              frozen LLM (base weights)
                                       │
                          ┌────────────┴─────────────┐
                          │      shared KV cache     │
                          └────────────┬─────────────┘
                                       │
                  ┌────────────────────┼────────────────────┐
                  │                    │                    │
           AR head (pre-fill)   diffusion head        consensus check
           builds the cache     parallel draft        (exact match)
                                                            │
                                                            ▼
                                                output identical to AR
                                                (7.8x speed, O(1) memory)

Orthrus puts two generation heads on the same frozen LLM and a single shared KV cache. The autoregressive head executes pre-fill and populates the cache; the diffusion head reads from that cache to draft tokens in parallel; an exact-consensus mechanism makes the output bit-identical to pure AR. The diffusion head is trainable but the LLM is frozen, so this is an inference-acceleration retrofit and not a re-pretraining recipe.

The structural novelty is the exact-consensus piece. Speculative decoding for RL rollouts uses a separate draft model and accepts the standard draft-verify tradeoff. Orthrus does verification implicitly through the shared cache and the consensus check, so the two heads never disagree on what got generated. That converts a probabilistic speedup ("most drafts accepted") into a deterministic one ("output is identical"). Lossless inference acceleration without a separate draft model is the new design point.

The composition with the recent wiki additions is the more interesting part. Two papers this week argue for "add a structure during one phase, deploy without it", Lighthouse Attention (@NousResearch retweet 05-12, paper 2605.06554) trains long-context with a removable subquadratic wrapper, and Token Superposition Training pre-trains with bag-of-tokens prediction in the first third while keeping the deployed model identical to standard NTP. Orthrus is the third paper in the same week landing on a related frame: the deployed model can run with structure (parallel head) that the base never had, exactly when the structure does not change generation. Three papers in one week converging on the asymmetric-training-or-inference axis.

The cache implication is the load-bearing one. After Make Each Token Count (eviction is policy-aware) and Orthrus (drafting is policy-identical), the KV cache is being treated as a programmable substrate, not a static buffer. The next paper in this thread is the composition: can eviction gates run while two views draft from the same cache? Untested.

Why it matters: The first lossless parallel-decoding architecture in the wiki where the verification cost is structurally O(1) instead of probabilistic. The cache stays the focal object.

Research angle: Three open questions. (1) Does the consensus rate hold at long context? Diffusion drafting historically degrades with sequence length; the abstract does not break down speedup by length. (2) Reasoning-model workloads (long CoT) are exactly where the AR bottleneck bites hardest. If reasoning models inherit the full 7.8x, the result is a serious production number; if not, the headline overstates. (3) Composition with selective eviction is the cleanest near-term experiment.

Full summary


The Extrapolation Cliff: a closed-form clip-safety threshold for on-policy distillation

Above λ-star, on-policy distillation stops being format-preserving and starts being format-collapsing. The threshold has a closed form in three measurable quantities. Pre-registered tests on Amazon Fashion hit their locked prediction windows.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. On-policy distillation, RL post-training, theoretical guarantees

  λ < 1            λ in [1, λ-star)            λ > λ-star
  ─────────        ──────────────────          ────────────
  conservative     student exceeds teacher     format collapse
  distillation     while staying inside        the extrapolated
                   the output contract         fixed point exits
                                                clip-safe region

  λ-star(p, b, c) = closed-form in three measurable quantities:
    p = teacher modal probability on the dominant equivalence class
    b = warm-start mass (how much SFT pre-OPD anchored the student)
    c = importance-sampling clip strength

The Extrapolation Cliff is the first paper in the wiki to give a closed-form for where on-policy distillation breaks. The wiki has been tracking the OPD thread since TIP (04-16, only 10% of distillation tokens carry signal) and LongAct (04-18, sparse RL updates dominate dense). Those papers gave empirical evidence that distillation has structure. The Cliff turns that structure into a derivation: above λ-star, the extrapolated fixed point exits the clip-safe region, and format collapses.

The three pre-registered tests on Amazon Fashion are the methodological move. The paper locks predictions before running: a fine-grid cliff interval, a budget-extension test, a small-clip cross-prediction. All three fall within their locked prediction windows; the small-clip value matches the closed-form prediction below grid resolution. This is the first OPD paper in the wiki that pre-registers its theoretical claims.

Operating just below λ-star, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The interesting line in the result: NDCG@1 on parsed outputs is flat across λ, but parse validity sharply changes at the predicted boundary. The accuracy axis is not where the cliff lives; the format-adherence axis is. The student is competent throughout the range; it just stops producing valid output above λ-star.

The composition with the other RL-bound papers from this month is the load-bearing read. G-Zero on 05-12 gave the first formal best-iterate suboptimality bound in verifier-free self-play RL. The Cliff is the second formal bound in the same family, now for OPD with structured outputs. Two papers in three days putting bounds on previously empirical RL post-training pipelines. The theoretical era of RL-for-LLMs has begun.

Why it matters: RL post-training was a recipe-by-vibe field. The Cliff gives a closed-form prediction that holds in pre-registered tests. The economics get cleaner too: 1.7B at parity with 8B-SFT means 5x parameter reduction at no in-domain cost when you operate at the cliff.

Research angle: (1) Generalize beyond Bernoulli and K-ary listwise, free-form structured outputs like XML and tool-call traces have different equivalence-class structure. (2) Online λ-star estimation: p, b, c are all observable per step, so a scheduler that estimates λ-star online and clips λ to (1−ε)λ-star is the cleanest extension to ship. (3) Does the cliff exist in RLVR/GRPO with structured outputs (agentic tool-calling)? If yes, this paper's framing extends well beyond distillation.

Full summary


Position: LLM inference is energy-to-token production

All current inference benchmarks (accuracy, latency, throughput, utilization) miss the binding constraint at deployment scale. The output is a quality-conditioned token bounded by both compute-per-token and energy-per-token ceilings. SemiAnalysis just shipped the empirical version; this is the formal version.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1. Hardware-bounded inference, deployment economics, datacenter ceilings

  Standard inference benchmark              Energy-to-Token framing
  ────────────────────────────────          ───────────────────────────────────
  token rate ≤ compute_ceiling              token rate ≤ min(compute_ceiling,
                                                              energy_ceiling)

                                            energy_ceiling = delivered_power
                                                              / PUE / utilization

  binding constraint = peak FLOPS           binding constraint moves toward
                                            grid-delivered watts as buildouts
                                            hit capacity limits

The position paper argues that LLM inference is being mis-evaluated. Accuracy, latency, throughput, hardware utilization are all software/model metrics. At deployment scale, the actual product is a quality-conditioned token produced under joint constraints from effective compute, delivered datacenter power, cooling capacity, PUE, and utilization. The paper formalizes this with a Token Production Function: token rate bounded by both compute-per-token and energy-per-token ceilings. Listed API prices vary by over an order of magnitude across providers, but the paper is careful: price is directional motivation, not causal evidence of marginal cost. The core question is when the binding constraint moves from theoretical peak compute toward delivered power, cooling, and operational efficiency.

The SemiAnalysis Cerebras piece (newsletter, 05-13, Gmail-starred) is the empirical version of the same argument. Cerebras's wafer-scale engine wins on the interactivity dimension that HBM-based GPUs cannot match, precisely because the SRAM-per-FLOP ratio shifts the energy ceiling. The SemiAnalysis quote that captures it: "past a certain threshold of intelligence, developers prefer faster tokens to smarter tokens." Translated into the Position paper's frame: above some compute floor, the binding constraint moves toward energy-per-token, where SRAM machines (Cerebras, Groq) shift the frontier. Opus 4.6 Fast charging 6x the price for 2.5x interactivity (now degraded to 1.75x) is the same revealed preference.

The wiki has been tracking the capacity-binding-constraint thread for two months: ByteDance $30B PRC-chip commitment (05-08), Anthropic-Colossus deal (05-08), Broadcom-OpenAI-Microsoft (05-10), NVIDIA $40B in AI partners (05-11). Every one of those is implicitly the same thesis the Position paper makes explicit: the cost floor on inference is no longer set by FLOPs but by delivered watts. Today's paper is the first arXiv-side framing in the wiki that formalizes this.

The routing implication is the load-bearing one. If inference is energy-bound, multi-model routing systems should optimize on energy-per-token-at-quality, not latency-at-quality. None of the routing papers in the wiki (Netflix State of Routing, Sakana Conductor, CARE) currently use this objective. The cleanest near-term routing research direction is now visible.

Why it matters: The benchmark we use for production inference is wrong. Throughput at fixed utilization is the historical answer; energy-per-token at fixed quality is the answer the deployment economics actually call for.

Research angle: (1) A measured benchmark, InferenceMax with watts. Whoever ships that sets the new evaluation standard. (2) PUE-conditioned pricing: do API prices start to correlate with regional grid carbon intensity in 2027? Falsifiable. (3) Routing-as-energy-allocation: re-derive existing routing systems under an energy objective; this is a one-paper rewrite of the routing literature.

Full summary


DAgger for LLM agents + AgentLens + MAP: the agentic-stack triangle

DAgger fixes covariate shift with dense teacher supervision on on-policy states. AgentLens shows that 10.7% of pass-rate wins are Lucky. MAP says the failure mode is delayed environmental perception and proposes a Map-then-Act paradigm. Three papers in one day pointing at the same diagnosis from three angles.

Sources: HuggingFace Daily Papers (all three) Links: DAgger paper · DAgger wiki · AgentLens paper · AgentLens wiki · MAP paper · MAP wiki Tier: 2. Long-horizon agent training, evaluation, planning paradigms

   AgentLens (measure)        DAgger (training-side fix)      MAP (architecture-side fix)
   ────────────────────       ──────────────────────────      ────────────────────────────
   10.7% of passing           interpolate student and         build env prior first
   trajectories on            teacher trajectories at         then act with the prior
   SWE-bench Verified         turn level; supervise           grounded
   are Lucky Passes           with teacher labels on
   (regression cycles,        on-policy states                Global Exploration
   blind retries, missing                                     ─► Task-Specific Mapping
   verification, temporal     +3.9 SWE-bench at 4B            ─► Knowledge-Augmented
   disorder)                  4B beats published 8B            Execution

         │                            │                              │
         └────────────────────────────┼──────────────────────────────┘
                                      │
                            shared diagnosis:
                  covariate-shifted trajectories produce
                  chaotic state distributions and Lucky Passes

DAgger is the 2011 Ross-Gordon-Bagnell algorithm re-applied to multi-turn LM agents. Collect trajectories by interpolating student and teacher policies at the turn level, train the student with teacher labels on those trajectories. The student therefore sees realistic deployment states (not idealized teacher trajectories) and gets dense teacher feedback (not sparse outcome rewards). +3.9 over the strongest post-training baseline at 4B, +3.6 at 8B, the 4B model reaches 27.3% on SWE-bench Verified and beats several published 8B SWE-agent systems. Same prescription that TIP and the Extrapolation Cliff give for distillation, one level up: train on the data the deployed model will actually see.

AgentLens makes the diagnosis crisp. Of 2,614 OpenHands trajectories on SWE-bench Verified, 10.7% of the passing ones are Lucky: regression cycles, blind retries, missing verification, temporally disordered work. The framework merges per-task passing trajectories into a Prefix Tree Acceptor reference and uses a context-sensitive intent-stage labeler (Exploration / Implementation / Verification / Orchestration) that uses trajectory history rather than tool identity. Some models drop 5 ranking positions when scored by quality instead of pass rate. AgentLens-Bench: 1,815 trajectories from 47 tasks across 8 model backends.

MAP attacks the same problem at the architecture layer. The diagnosis is Delayed Environmental Perception: agents acquire knowledge reactively during execution and fall into trial-and-error loops because they didn't build the environmental prior. The fix is a three-stage paradigm: Global Exploration (env-general priors) → Task-Specific Mapping (cognitive map conditional on the task) → Knowledge-Augmented Execution. Frontier models surpass near-zero ARC-AGI-3 baselines in 22 of 25 game environments under MAP. The MAP-2K dataset of map-then-act trajectories beats training on expert execution traces, which is the paper's most interesting claim: understanding environments is more fundamental than imitating them.

Read together, the three papers form a complete agentic-stack triangle. AgentLens measures the symptom; DAgger fixes the trajectory distribution; MAP fixes the agent's relationship to the environment. The natural composition is: use AgentLens to filter Lucky-vs-Solid trajectories, use MAP-2K-style exploration data on top of student trajectories, then run DAgger to interpolate student-and-teacher policies under MAP. None of these papers proposes that composition. Three single-paper improvements; the joint paper hasn't been written.

Why it matters: SWE-bench-style pass-rate has been the agentic benchmark gold standard for a year. Today, three papers (one measurement, two prescriptions) make it clear that pass-rate alone over-aggregates and that the agentic stack has multiple separable failure modes. Anyone shipping an agent benchmark in 2026 Q3 will need to address process quality.

Research angle: Three threads to pull. (1) DAgger + MAP-2K composition is the cleanest near-term experiment. (2) Per-stage post-training targets from AgentLens labels, reward Verification more, penalize Verification-skipped, train RLVR with stage-conditional reward shaping. (3) Generalize AgentLens beyond OpenHands trajectories: if the 10.7% Lucky rate holds on Claude-Code, Aider, Cursor SWE workflows, every SWE-bench leaderboard needs re-ranking.

DAgger summary · AgentLens summary · MAP summary


WriteSAE: sparse autoencoders for the matrix-recurrent cache write

Residual SAEs can't reach state-space and hybrid models because those models write to a d_k × d_v cache via rank-1 outer products that no vector atom can replace. WriteSAE factors atoms into the native write shape, gives a closed form for per-token logit shift (R² = 0.98), and ships the first behavioral install at the cache-write site.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2. Interpretability, recurrent / state-space models, mechanistic intervention

The interpretability thread has been moving from "find features" to "find features that can be installed" for two months. WriteSAE is the first paper that does this on the model class transformer-tooling SAEs structurally cannot reach. Gated DeltaNet, Mamba-2, and RWKV-7 write to a d_k × d_v cache through rank-1 updates k_t v_t^T. A standard vector atom cannot replace a rank-1 outer product, so SAEs over the residual stream are blind to the substrate where these models actually store information.

WriteSAE factors each decoder atom into the native rank-1 outer-product shape, exposes a closed form for per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. The validation numbers are tight: 92.4% of 4,851 firings at Qwen3.5-0.8B L9 H4 beat matched-norm ablation, the closed form predicts measured effects at R² = 0.98, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. The behavioral result is the headline: sustained three-position installs at 3x lift bring mid-rank target-in-continuation from 33.3% to 100% under greedy decoding. First behavioral install at the matrix-recurrent write site in the wiki.

The timing is what makes this load-bearing. Practitioner reports on r/LocalLLaMA confirm hybrid Mamba-DeltaNet architectures are now the default for long-context small models (Qwen 3.6 35B-A3B, MTP on Unsloth + Qwen3.6). The wiki's tracked architecture papers, MDN momentum DeltaNet, Nemotron-3 Super hybrid MoE, sit in exactly this class. Until WriteSAE, those models were opaque to mechanistic interpretability and to the kind of single-feature install that Kazemi's refusal-neurons paper (@hamid_kazemi22 retweet 05-12) demonstrated on dense transformers.

Why it matters: The interpretability tooling now reaches the architecture class that production inference is moving toward. Without this paper, mechanistic safety and interpretability would have been a transformers-only field while deployment shifted to hybrid models.

Research angle: (1) Cross-architecture atom transfer: do WriteSAE atoms trained on Mamba-2 transfer to Gated DeltaNet or RWKV-7? Evidence of universal features in the state-space class. (2) Single-atom refusal install on hybrid models: structural analogue of Kazemi's MLP-neuron result. (3) WriteSAE as a real-time monitor, the feature dictionary at the cache-write site is reusable for Hodoscope-style unsupervised monitoring.

Full summary


Industry Pulse


Connecting the Dots

                Today's research papers                Today's industry / social
                ───────────────────────                ─────────────────────────
 layer-by-layer routing:                               SemiAnalysis Cerebras (Gmail)
   MinT (catalog)  ───────────────────────────────────►  empirical: SRAM wins on
   Orthrus (KV cache)                                    interactivity-per-watt
   Extrapolation Cliff (distillation loss)               ▲
                                                         │
 Energy-to-Token (position)  ◄────────────────────────────┘
                                                         │
 agentic stack triangle:                                Pangram (slop FPR)
   DAgger (training)    ─┐                              ▲
   AgentLens (process) ──┼──► over-aggregation of      │
   MAP (architecture)   ─┘    leaderboard metrics  ────┘
                              now confirmed by Soohak,
                              AgentLens, AssetOps in 3 days

 WriteSAE  ─────────────────► refusal-neurons on hybrid
                              (Kazemi tweet 05-12)
                              interpretability stack reaches
                              state-space models

Cross-paper thread #1: "where does the routing decision live." MinT routes at the adapter catalog. Orthrus routes between two generation heads on a shared cache. The Extrapolation Cliff routes the on-policy distillation loss at a derived threshold. CARE (05-11) routed at the MoE expert layer; Sakana Conductor (05-11) routed between frontier models. Five layers of the stack now have explicit routing papers in three weeks. The next composition (and the one no paper has shipped yet) is the joint routing problem: which model → which adapter → which cache eviction policy → which decoding head → which distillation loss. The unified routing-decision-graph is the next architectural object.

Cross-paper thread #2: empirical and formal versions of the same energy thesis. The SemiAnalysis Cerebras piece dropped 18 hours before the Energy-to-Token position paper. Both argue the same thing: token rate is bounded by energy, and the interactivity-per-watt frontier matters more than the FLOPS-per-second frontier. The Anthropic Opus 4.6 Fast tier, the Cerebras IPO valuation, the Anthropic-Colossus capacity deal, the ByteDance $30B PRC-chip commitment, the Broadcom-OpenAI-Microsoft deal, every one of these is the same revealed-preference signal. The wiki should expect a benchmark in the next 90 days: InferenceMax-with-watts. If it lands, the routing literature gets rewritten on energy objectives.

Cross-paper thread #3: benchmark over-aggregation. Three papers in three days say pass-rate alone over-aggregates. Soohak (05-12), research-math frontier models confidently answer ill-posed problems. AgentLens (today), 10.7% of passing SWE-bench Verified trajectories are Lucky. AssetOpsBench (today), public-to-hidden score correlation of −0.13 across 234 submissions. Pangram (Algorithmic Bridge today) adds the slop-detection angle: 21% of ICLR 2026 reviews are AI-generated; the conference-review benchmark itself has integrity problems. The cyber-eval doubling-rate from AISI (05-13) is the next benchmark to put under the Lucky-Pass lens. How much of the doubling is process-quality?

Cross-paper thread #4: agentic-stack triangle. AgentLens + DAgger + MAP form a complete diagnosis-and-prescription triangle for long-horizon agent quality. The Twitter signal from 05-13 amplifies this: @dair_ai's Bystander Effect retweet (arXiv 2605.10698) found that multi-agent systems suppress correct answers under social pressure; agents "compute the correct derivation internally but suffer Alignment Hallucinations" by appeasing a swarm. That is the same family of process-failure that AgentLens names from the trajectory side and that DAgger fixes from the supervision side. Four papers / signals in one week on the agentic-process-quality axis.

Cross-paper thread #5: interpretability follows the architecture. WriteSAE makes mechanistic interpretability reachable on the matrix-recurrent class that production is moving toward. Pair with the refusal-neurons retweet (single MLP neuron bypasses safety alignment across 7 dense transformer models). The refusal-neuron result on transformers is the dense-model version of what WriteSAE makes possible on hybrid models. The safety-tooling stack now has hooks into both architecture families.

Reddit practitioner-side signal: the Dynamic persistent tile scheduling with Cluster Launch Control on Blackwell post is a direct hit on the GPU-kernels concept page, Blackwell's CLC primitive is the hardware-side complement to the routing-decision-graph at the kernel level. r/LocalLLaMA's MoE-offload + KV-cache-quantization tutorial (24 tok/s on a GTX 1080 8GB) and MI50s Qwen 3.6 27B benchmark (52.8 tps generation, 1569 tps prefill) confirm hybrid + offload + quantization is the dominant consumer-GPU pattern. MMProLong's training recipe is what gets you a long-context model worth deploying in this stack. The llama.cpp Docker MTP images post closes the loop by making the deployment pattern reproducible.


Worth Watching


Quick Hits

Many-Shot CoT-ICL (arXiv 2605.13511). Many-shot in-context learning with chain-of-thought demonstrations does not behave like many-shot ICL on non-reasoning tasks. Three findings: scaling helps reasoning models, fails on non-reasoning; semantic similarity for retrieval fails on reasoning because procedural compatibility is orthogonal; order matters and CDS (Curvilinear Demonstration Selection) gives up to 5.42 pp on geometry with 64 demos. The framing line: long context is a structured curriculum, not a retrieval buffer. Cross-references Make Each Token Count (long context rewards selection, not aggregation) at the in-context-learning layer. → summary

Context Training with Active Information Seeking (arXiv 2605.13050). Equips context optimizers with Wikipedia search and browser tools. Naive integration degrades performance; the recovery is a search-based training procedure that maintains and prunes multiple candidate contexts. Gains on Flores+, HealthBench, LiveCodeBench, HLE. Contexts transfer across models, buying something about the problem, not the model. → summary

MMProLong long-context VLM (arXiv 2605.13831). 7B Qwen2.5-VL extended 32K → 128K with only 5B tokens; generalizes to 256K and 512K beyond training. Long-document VQA > OCR; balanced length distribution beats target-length; retrieval is the bottleneck. The training-side complement to Make-Each-Token-Count's inference-side claim. → summary

Qwen-Image-VAE-2.0 (arXiv 2605.13565). High-compression VAE with Global Skip Connections, expanded latent channels, asymmetric attention-free encoder-decoder. New OmniDoc-TokenBench for text-rich documents. SOTA on reconstruction + superior diffusability. Tier 3, vision generation, but the asymmetric encoder-decoder design hints at the same "asymmetric training, identical inference" pattern as Orthrus and Lighthouse Attention.

Edit-Compass & EditReward-Compass (arXiv 2605.13062). 2,388 instances + 2,251 preference pairs for image editing and reward modeling. Substantial gap between proprietary and open-source; weakness clusters in world-knowledge understanding, visual reasoning, multi-image editing. Native MLLMs beat existing open-source reward models on the preference task. Tier 3, multimodal reward modeling.

AnyFlow + Asymmetric Flow Models (AnyFlow, AsymFlow). Two flow / diffusion papers from today. AnyFlow keeps test-time scaling behavior under consistency distillation via flow-map transition learning. AsymFlow restricts noise prediction to a low-rank subspace, gives the first route for fine-tuning pretrained latent flow models into pixel-space; the pixel AsymFlow fine-tuned from FLUX.2 klein 9B beats its latent base on HPSv3, DPG-Bench, GenEval. Tier 3 vision generation, noted for the asymmetric-prediction design that echoes the day's other architectural papers.

DAWN World-Action Interactive Models (arXiv 2605.11550). Latent-space World-Action Interactive Model for autonomous driving. World predictor and action denoiser refine each other in a short explicit latent rollout. Tier 4 (autonomous driving) but the latent-rollout-as-coordination idea is the same architectural primitive as Orthrus's shared cache.

RoboEvolve (arXiv 2605.13775). VLM planner + VGM simulator in a co-evolutionary loop on unlabeled seed images. 50x data reduction vs supervised baselines. Tier 4 robotics, but the dual-model evolutionary frame is interesting.

PresentAgent-2 (arXiv 2605.11363). Generalist multimodal presentation agent across Single Presentation, Discussion, Interaction modes. Slides + audio + dynamic media. Adjacent to the agentic stack. Tier 3.

Visual Aesthetic Benchmark (arXiv 2605.12684). MLLMs at 26.5% on best/worst aesthetic selection vs 68.9% human-expert. Score-derived rankings align poorly with direct ranking. The comparative-vs-scalar split is the same point Auto-Rubric and DeltaRubric made for multimodal reward modeling on 05-12. Tier 3.

ShapeCodeBench (arXiv 2605.11680). Perception-to-program reconstruction benchmark. Best multimodal exact-match: 0.027. Best heuristic exact-match: 0.087. Far from saturated. Tier 3 with Tier-2 implications for compositional visual reasoning.

AssetOpsBench retrospective (arXiv 2605.08518). Public-to-hidden score correlation of −0.13 on 234 submissions to the CODS 2025 challenge. Third paper in the benchmark-skepticism thread; surfaces composite-metric and ranking-stability gaps. Pairs with AgentLens and Soohak.

Frequency Bias and OOD generalization in Neural Operators (arXiv 2605.12997). FNO degrades sharply on unseen high-frequency inputs, DeepONet degrades more gracefully. Tier 4 (PDE surrogates) but a clean architectural-bias result.

SemiAnalysis Cerebras (Gmail-starred + newsletter). Four-article-length writeup ahead of the IPO. Deep dive on the wafer-scale engine, the CS-3 system, BOM economics, the OpenAI 750MW deal, and the hybrid-bonding optical-transceiver roadmap. Read alongside today's Energy-to-Token position paper.

AI Weekly: $725B on AI slop (Gmail-starred AI Weekly issue). Hyperscaler capex framed as "$725B bet on what no one wanted." Pairs with Pangram's 21%-of-ICLR-reviews finding and the broader AI-content-and-trust thread.

AI Breakfast Googlebook coverage (Gmail-starred). Android-ChromeOS merger, Magic Pointer cursor, Gemini Intelligence proactive layer, Gemini Omni video editing model. The AI-first laptop category is now real with Acer, ASUS, Dell, HP, Lenovo as launch partners. The Reuters report of Google-SpaceX talks for Project Suncatcher orbital data centers is the long-tail signal.

Anders Hejlsberg on TypeScript, C#, Turbo Pascal (Pragmatic Engineer podcast). Worth listening to for the "AI is limited for writing compilers, for now" observation and "training-data volume is what makes AI great at TypeScript and Python." Adjacent to the AI-assisted-coding deployment thread.

Simon Willison's CSP Allow-list Experiment + Datasette blog launch (CSP, Datasette blog). The CSP experiment is the operational primitive for sandboxed-iframe AI agents that need network egress. The Datasette blog launch is a small one, Simon used Codex desktop and the Markdown session-transcript export feature. Both are practitioner-side.

Ken Huang's "Agentic AI Harness Pattern" + 10 new patterns (Substack). Cost & token accounting, cancellation, slash commands, working-directory resolution, trajectory compression, terminal UI, migrations, plugin discovery, specialized subagents, credential lifecycle. Production-agent plumbing, complement to the architectural patterns from Claude Code / Hermes leaks. Useful reference catalog if you ship agents.


Sources ingested today: HF (22 papers), RSS (26 new posts for 2026-05-13), Gmail (3 starred: AI Weekly, SemiAnalysis Cerebras, AI Breakfast Googlebook), Twitter morning slot (22 tweets / 16 retweets / 13 articles) + 05-13 evening (3 tweets / 0 retweets) + 05-13 afternoon (1 tweet / 0 retweets), Kurate cs.AI + cs.LG weekly leaderboards (no rising authors crossed threshold), Reddit (8 subs, 12 posts after filters) | Wiki pages updated: 12 (4 Tier 1 summaries: MinT, Orthrus, Extrapolation Cliff, Energy-to-Token; 4 Tier 2 agentic: DAgger, AgentLens, MAP, Context Training; 1 Tier 2 LLM: Many-Shot CoT-ICL; 1 Tier 2 interpretability: WriteSAE; 1 Tier 2 multimodal: MMProLong; 3 concept-page updates: kv-cache.md, rl-for-llms.md, agent-benchmarks.md, responsible-ai.md)