cere-bro | 2026-05-19
The MoE design surface gets its fourth principled axis, the KV-cache stack adds the chunked-prefill kernel it had been missing, and a Newton-style solver makes the layer-by-layer assumption itself negotiable.
TL;DR
Today is a stack-completion day on two fronts and a structural surprise on a third. ZEDA converts a frozen mixture-of-experts model (a network whose layers route each token through a small subset of expert sub-networks) into a dynamic one via self-distillation, eliminating over 50% of expert compute at marginal accuracy loss and finishing a four-axis MoE map that BEAM, HodgeCover, and MoE-muP filled in over the past 72 hours. CompactAttention treats the 2D block-sparse attention mask as a selection signal rather than an execution plan, gives 2.72x speedup at 128K context under chunked prefill (the default long-context serving pattern), and is the fourth principled axis on the KV cache map (the memory store that saves prior attention computations to avoid recomputing). SNLP solves the layer stack as a nonlinear residual equation with Newton-style corrections and shows that the training objective that makes layer-parallel inference work also improves baseline perplexity by 4.7-23.4%. EndPrompt extends LLaMA from 8K to 64K context using short sequences plus a brief terminal prompt with positional indices near the target length, beating full-length fine-tuning at a fraction of the compute. LongLive-2.0 is the first end-to-end NVFP4 (NVIDIA's 4-bit floating-point format native to Blackwell) training-and-inference stack for video at frontier scale. On the industry side, Cursor ships Composer 2.5 on Kimi K2.5 with SpaceXAI follow-on training at 10x compute, NVIDIA hand-delivers the first Vera CPUs to four major AI labs, and Anthropic ships Fast Mode plus prompt-cache diagnostics on Opus 4.7.
The Big Picture
The two design surfaces the wiki has been mapping for a month, MoE architecture and KV cache management, both finished as four-axis stacks today. MoE now spans pre-training scaling (MoE-muP from 2026-05-17), per-token activation (BEAM from 2026-05-16), post-training resident-count compression (HodgeCover from 2026-05-18), and post-training static-to-dynamic conversion (ZEDA today). KV cache now spans learned eviction (Make Each Token Count from 2026-05-12), head-role compression (Forcing-KV from 2026-05-15), content-aware rescheduling (FashionChameleon from 2026-05-18), and chunked-prefill block-table construction (CompactAttention today), with NVFP4 quantization layering on the precision axis via LongLive-2.0. In both cases the integration paper that composes all four is one experiment away, and the frontier open-model wave that needs it is already shipped without a principled recipe.
The structural surprise sits on a third axis. SNLP opens layer-parallel inference, a dimension the wiki had no principled entry for, by solving the layer stack as a fixed-point equation with cheap architecture-induced surrogate Jacobians. The deeper finding is that the training-time regularization that makes the solver converge also lifts baseline perplexity by 4.7-23.4%. That mirrors what Make Each Token Count showed at the KV-eviction layer in 2026-05-12, where selective cache retention surpassed the full cache rather than just approximating it. Two independent pieces of evidence now suggest that inference acceleration and pre-training quality are not always at odds, which contradicts the field's default mental model.
A fifth deployment-calibration benchmark also lands today. AgentKernelArena measures coding agents on GPU kernel optimization, and finds the same pattern WildClawBench, CurveBench, PAGER, and DiagnosticIQ found in the past five days: agents look strong on configurations they saw and fragile on configurations they did not. Five benchmarks in five days across five distinct task surfaces is no longer coincidence. The wiki's running prediction that the field would converge on representational interventions but suggested per-domain calibration is being confirmed at higher resolution every day.
Deep Dives
ZEDA: post-trained MoE can skip half its experts via self-distillation
Frontier open mixture-of-experts models ship as static top-K routers. ZEDA converts them to dynamic without retraining, eliminating over half of expert compute at marginal accuracy loss.
Source: HuggingFace Daily Papers 2026-05-19 Links: Paper · Wiki summary
What is it about? ZEDA is a post-training conversion method for mixture-of-experts language models. It takes a finished model (Qwen3-30B-A3B, GLM-4.7-Flash) and modifies how each layer routes tokens through its experts, without changing the trained weights.
What problem does it solve? Frontier open MoE releases use static top-K routing: every token goes through exactly K experts at every layer. That is wasteful because some tokens need less computation than others. Existing dynamic-routing methods require pre-training intervention, which operators of frozen open models cannot apply.
What's the core novelty? Parameter-free zero-output experts are injected into each layer. The router gets K+1 slots per token, and selecting the zero expert means skipping computation at this layer for this token. Two-stage self-distillation from the original frozen MoE as teacher, plus a group-level balancing loss, prevents router collapse onto the zero option.
Key takeaways
- Over 50% of expert FLOPs eliminated at marginal accuracy loss across 11 math, code, and instruction benchmarks.
- Beats the strongest dynamic-MoE baseline by 6.1 points on Qwen3-30B-A3B and 4.0 points on GLM-4.7-Flash.
- End-to-end inference speedup of roughly 1.20x; gap between FLOP reduction and wall-clock is the standard MoE routing-and-dispatch overhead.
- Self-distillation pattern needs no external teacher and no task data, making the conversion routine.
Gaps in the study Only two MoE models tested. Whether the conversion holds on Gemma 4 26B-A4B, DeepSeek V4 Flash, Kimi K2.6, or MiMo-V2.5-Pro is unknown. The 1.20x wall-clock at 50%+ FLOP reduction also suggests the routing overhead absorbs most of the theoretical gain on current vLLM kernels.
Industrial implication Frontier open MoEs do not get retrained from scratch in response to research. ZEDA is the first wiki entry showing that the static-to-dynamic conversion is cheap enough (two-stage self-distillation, no external teacher) to apply as a routine post-release step. Expect serving stacks to ship this within a quarter once a custom kernel closes the routing-overhead gap.
CompactAttention: chunked-prefill KV speedup via Block-Union selection
The 2D block-sparse attention mask is no longer the thing the kernel runs. It becomes the input to a union construction that builds a minimal block table under paged execution. 2.72x at 128K context.
Source: HuggingFace Daily Papers 2026-05-19 Links: Paper · Wiki summary
What is it about? CompactAttention is a sparse-attention kernel design for chunked prefill (the long-context serving pattern that splits a prompt into chunks because the full prompt would exceed memory). It changes how a sparse mask gets turned into actual GPU work.
What problem does it solve? Existing sparse-attention kernels treat the block-sparse mask as the kernel's execution plan and iterate over it directly. Under chunked prefill, the chunk size caps the query length, which makes these kernels inefficient at small Q. The most recent prior method, QUOKA, sidesteps this by subsampling queries and selecting at the token level, but misses query-specific KV entries and requires an explicit cache-copy step.
What's the core novelty? The structural move is mask-as-selection-signal, not mask-as-execution-plan. Two unions construct the block table: a Q-block union collects every KV block any Q-block in the chunk selected, and an intra-group union collects every KV block any head in the same Grouped-Query Attention group selected. The result is the minimal block table per GQA group, accessed in place under paged execution, with no KV compaction.
Key takeaways
- 2.72x attention speedup at 128K context on LLaMA-3.1-8B-Instruct.
- RULER accuracy stays close to dense baseline.
- Composes naturally with any existing mask-generation method.
- Works at the GQA-group granularity that real serving stacks already use.
Gaps in the study Tested up to 128K only. Whether the speedup extends or saturates at 256K and 512K is the load-bearing extrapolation. No composition tests with learned eviction (Make Each Token Count) or head-role compression (Forcing-KV).
Industrial implication Chunked prefill is the default long-context serving pattern. Any operator running at 64K context and above will adopt the union construction within months once a reference kernel ships. The mask-as-selection-signal insight likely generalizes to every other sparse-attention method already in production.
EndPrompt: long-context extension from short training sequences
Exposing the model to long-range positional distances does not require physically long sequences. A short context plus a brief terminal prompt with positional indices near the target length beats full-length fine-tuning at a fraction of the compute.
Source: HuggingFace Daily Papers 2026-05-19 Links: Paper · Wiki summary
What is it about? EndPrompt is a recipe for extending a model's context window (8K to 64K in the experiments) without ever training on sequences at the target length.
What problem does it solve? Full-length fine-tuning is the standard recipe for context extension and incurs quadratic memory plus compute. Chunk-based simulation approaches split contiguous context into pseudo-long segments, which sacrifices semantic continuity. Operators want long context without the bill.
What's the core novelty? Two-segment construction. Segment one is the original short context. Segment two is a brief terminal prompt, but its positional indices are placed near the target length (e.g. near position 64K). Local distances live inside each segment, but the long-range relative distance lives across the boundary, inside a physically short sequence. The theoretical analysis grounds this in Rotary Position Embedding smoothness (RoPE, the positional encoding that rotates query and key vectors by token-position-dependent angles) and the Bernstein inequality.
Key takeaways
- 76.03 RULER average on 8K-to-64K extension, beating LongLoRA at 72.95, LCEG at 72.24, and full-length fine-tuning at 69.23.
- Highest LongBench average among the tested methods.
- Compute discount is substantial, since training stays at short-sequence cost.
- The smoothness-and-Bernstein argument suggests the discount comes from a structural property, not a hyperparameter tuning win.
Gaps in the study Tested only at 64K extension. Whether the result holds at 128K and 256K is the open question. The RoPE-specific analysis does not transfer obviously to linear-attention substrates like Mamba2 or DeltaNet.
Industrial implication If this replicates at 128K and 256K, every operator extending an open-model release will adopt it. The full-length-fine-tuning bill has been the bottleneck. EndPrompt cuts it to short-sequence-fine-tuning cost at better RULER.
SNLP: layer-parallel inference via structured Newton corrections
The layer-by-layer assumption inside a single forward pass is negotiable. Newton-style corrections with architecture-induced surrogate Jacobians give 2.3x wall-clock and the training-time regularizer that makes it work also lifts baseline perplexity by up to 23.4%.
Source: HuggingFace Daily Papers 2026-05-19 Links: Paper · Wiki summary
What is it about? SNLP attacks the sequential dependency between Transformer layers within one forward pass. Tensor and pipeline parallelism reduce per-layer latency or help across requests, but the within-pass layer-by-layer order has not been parallelized at scale before.
What problem does it solve? Exact Newton iteration over the layer stack needs Jacobian-vector products at every layer, which costs as much as the original sequential forward. Naive fixed-point iteration on a trained Transformer is unstable. Layer-parallel inference within a forward pass had no principled entry until today.
What's the core novelty? Treat the hidden-state trace across L layers as the fixed point of a nonlinear residual equation. Replace exact Jacobians with cheap architecture-induced surrogates. For residual Transformers (the dominant family), the surrogate is the identity and the Newton correction reduces to a prefix-sum-style update across layers (Identity Newton, IDN). For multi-head Compressed attention architectures, the surrogate is the residual mixing matrix the architecture already computes (HC Newton, HCN). SNLP-aware regularization at training time makes one or a few iterations approximate the full sequential forward.
Key takeaways
- 2.3x wall-clock on a 0.5B Nanochat-scale model with PPL improved by 6.1%.
- SNLP-aware regularization improves baseline sequential PPL by 4.7% to 23.4% as a side effect.
- HCN aligns with the multi-head Compressed family that DeepSeek V4 ships, surveyed in Raschka's 2026-05-17 architecture catalog.
- Exact convergence of Newton recovers sequential execution, so the method does not give a monotonic test-time scaling axis.
Gaps in the study Off-the-shelf pretrained models are less amenable; the regularizer must be in the training recipe. Tested only at 0.5B scale. Whether HCN beats IDN at frontier MoE scale is unknown.
Industrial implication A frontier lab that bakes SNLP regularization into the next pre-training run gets both better baseline PPL and 2.3x wall-clock inference. The win is contingent on training-time adoption, which is the deployment caveat. If any major lab adopts it, the layer-parallel axis is suddenly live.
LongLive-2.0: first end-to-end NVFP4 video training and inference
NVFP4 across both training and inference for long video generation. The first wiki paper to operationalize 4-bit floating-point end-to-end on a frontier-grade generative stack.
Source: HuggingFace Daily Papers 2026-05-19 Links: Paper · Wiki summary
What is it about? LongLive-2.0 is a long-video diffusion training and inference system running entirely in NVFP4 (NVIDIA's 4-bit floating-point format native to Blackwell hardware), with both weights and activations at 4 bits and a 4-bit KV cache.
What problem does it solve? Blackwell adoption is the dominant hardware story of 2026, but no paper had operationalized NVFP4 end-to-end on a frontier-grade generative stack. Video especially has long-sequence demands that make memory-bound bottlenecks worse than text.
What's the core novelty? Balanced sequence-parallel training pairs clean-history and noisy-target temporal chunks on each rank with sequence-parallel-aware chunked VAE encoding. Inference uses W4A4 NVFP4 on Blackwell with NVFP4-quantized KV cache and asynchronous streaming VAE decoding. The teacher-forcing layout is co-designed with the sequence-parallel execution layout, so the natural teacher-forcing mask becomes SP-aware.
Key takeaways
- 2.15x training speedup, 1.84x inference speedup, 45.7 FPS at 5B parameters.
- KV cache also quantized to NVFP4, lowering inter-GPU communication during sequence-parallel execution.
- Self-Forcing bypass: prior Self-Forcing methods required ODE initialization plus distribution-matching distillation. LongLive-2.0 directly tunes a diffusion model into a long multi-shot interactive AR diffusion model.
- Real-time generation available as a standalone LoRA at 4-to-2 denoising steps.
Gaps in the study Only demonstrated on video. Whether the NVFP4 stack generalizes to text LLMs at frontier scale is the open question. Non-Blackwell fallback path adds engineering complexity.
Industrial implication Text LLMs targeting Blackwell B200 and B300 will need similar end-to-end NVFP4 stacks to extract the new hardware's throughput. LongLive-2.0 is the first system-paper template for what that stack looks like in practice. Expect a text-LLM analog within 90 days.
PUMA: semantic-preserving early exit for reasoning models
Answer-level early-exit signals reflect answer readiness, not reasoning convergence, so they trigger before the model finishes self-correcting. PUMA reads reasoning-level semantic redundancy instead. 26.2% token reduction at preserved accuracy.
Source: HuggingFace Daily Papers 2026-05-19 Links: Paper · Wiki summary
What is it about? PUMA is a plug-and-play inference-time stopping policy for Large Reasoning Models (LRMs), models that produce long chain-of-thought traces before final answers.
What problem does it solve? LRMs overthink. They keep reasoning after a solution has stabilized, wasting tokens and increasing latency. Existing early-exit methods use answer-level signals (confidence, trial-answer consistency) that trigger before the model has finished exploring, which degrades accuracy and leaves the retained reasoning prefix semantically incomplete.
What's the core novelty? PUMA reads reasoning-level semantic redundancy. A lightweight Redundancy Detector identifies candidate exits when successive reasoning steps no longer add novel progress and instead revisit established conclusions. Answer-level verification gates the actual exit. Stopping conditions on both reasoning saturation and answer correctness, not either alone.
Key takeaways
- 26.2% average token reduction across five LRMs and five reasoning benchmarks at preserved accuracy.
- Coherence of the retained reasoning prefix is preserved (a property answer-level methods lose).
- Generalizes to code generation and zero-shot vision-language reasoning.
- The stopping policy can be internalized into the model rather than applied as an external detector.
Gaps in the study No compositional test with train-time reasoning interventions (CIPO from 2026-05-18, the on-policy-failure-recycling method; NudgeRL from 2026-05-18, the strategy-context-conditioned rollout method). Whether reasoning-level redundancy as an RLVR reward generalizes is untested.
Industrial implication Test-time compute is the third leg of frontier model economics, and LRMs spend most of it on long CoT. A 26.2% reduction at preserved accuracy is large enough to dominate ad-hoc tricks like temperature reduction or hard token caps. Reasoning-level signals are also interpretable, which lets operators audit which steps got flagged redundant.
AgentKernelArena: agentic GPU-kernel benchmark with generalization protocol
The first GPU-kernel-optimization benchmark designed for full agent workflows. Cursor Agent, Claude Code, and Codex Agent hit 6-7x mean speedups on seen shapes. On unseen shapes, PyTorch-to-HIP collapses. Agents hardcode shape-specific assumptions.
Source: HuggingFace Daily Papers 2026-05-19 Links: Paper · Wiki summary
What is it about? AgentKernelArena is a benchmark for coding agents that optimize GPU kernels. It measures full agent workflows (reading code, invoking compilers and profilers, iterating) rather than single-LLM-call generation.
What problem does it solve? The wiki tracked KernelBench-X on 2026-05-09 as the single-LLM-call benchmark covering 16 frontier models on 250 PyTorch operations. Agent workflows are different in kind and had no measurement infrastructure until now.
What's the core novelty? 196 tasks across three modes: HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation. Evaluation runs in isolated workspaces with gated compilation, correctness, and performance checks. An unseen-configuration generalization protocol tests whether optimizations transfer to input shapes the agent never observed during its run.
Key takeaways
- Cursor Agent, Claude Code, and Codex Agent reach 6.89x mean speedup on PyTorch-to-HIP, 6.69x on HIP-to-HIP, 2.13x on Triton-to-Triton.
- Correctness rates near perfect on seen configurations.
- On unseen shapes, HIP-to-HIP and Triton-to-Triton transfer well; PyTorch-to-HIP shows substantial correctness drops.
- The failure mode is shape-specific hardcoding: agents that generate kernels from scratch bake in input-shape assumptions.
Gaps in the study Only AMD (HIP) and Triton targets. No CUDA-native Blackwell kernel tests. No composition with profiling tools that would give agents better signal during iteration.
Industrial implication Human kernel-authoring time on B200 and B300 is the primary bottleneck for porting research to production. A 6x agent speedup on seen shapes is large enough to justify production teams setting up AgentKernelArena harnesses. The shape-hardcoding failure is the deployment caveat. Frontier serving stacks have heterogeneous shapes, so a kernel that runs fast at one shape but breaks at adjacent ones is not deployable. The harness-level fix (parameterized generation, shape-aware prompting) is where the next 60 days of agent-kernel research will focus.
Connecting the Dots
The MoE design surface finishes as a four-axis stack in 72 hours. MoE-muP from 2026-05-17 (the Vankadara et al. paper that derived a closed-form Maximally Scale-Stable Parameterization across five MoE axes using Dynamical Mean Field Theory) gave the forward pre-training direction. BEAM from 2026-05-16 (binary expert-activation masking trained end-to-end with straight-through estimators, hitting 98%+ retention at 85% FLOP reduction with a custom vLLM kernel) gave the per-token activation direction. HodgeCover from 2026-05-18 (the simplicial-Laplacian harmonic-kernel paper showing that learning-free MoE compression hits an obstruction picked out by Hodge decomposition) gave the post-training resident-count direction. ZEDA today gives the post-training static-to-dynamic direction. Four orthogonal MoE knobs in three days while the frontier open MoE wave (Gemma 4, DeepSeek V4 Pro and Flash, Kimi K2.6, Qwen3.6, MiMo-V2.5-Pro) has already shipped without any of them. The integration paper that composes all four is one experiment away.
The KV-cache design surface finishes as a parallel four-axis stack, also in roughly a month. Make Each Token Count from 2026-05-12 (the first wiki paper to formally claim that selective cache retention surpasses the full cache because long-context attention dilution makes the full cache suboptimal) covered learned eviction. Forcing-KV from 2026-05-15 (the autoregressive video diffusion paper that found attention heads cluster into static and dynamic functional roles tolerating different compression levels) covered head-role compression. FashionChameleon from 2026-05-18 (the training-free KV rescheduling for multi-conditioning that composes garment-KV refresh, historical-KV withdraw, and reference-KV disentangle) covered content-aware rescheduling. CompactAttention today covers chunked-prefill block-table construction. LongLive-2.0 today adds the NVFP4 precision axis. The text-side and video-side trajectories ran independently for a month and now converge on the same multi-axis map.
A deployment-calibration gap is now confirmed across five distinct domains in five days. WildClawBench on 2026-05-15 found 18-point harness spread on 60 long-horizon tasks. CurveBench on 2026-05-17 found that visual structural reasoning peaks Gemini 3.1 Pro at 71.1% Easy and 19.1% Hard, with RLVR lifting Qwen3-VL-8B from 2.8% to 33.3%. PAGER on 2026-05-18 found GUI agents at 88% action-type accuracy and under 6% task success on precision-sensitive geometric tasks. DiagnosticIQ on 2026-05-18 found top-3 LLMs within 1 Macro point on industrial-rule reasoning but 49-63% original-answer rate persisting under condition inversion. AgentKernelArena today shows agents reaching 6.89x mean speedup on PyTorch-to-HIP at seen shapes and substantial correctness drops at unseen ones. Five domains, same structural decoupling between in-distribution accuracy and out-of-distribution capability. The pattern threshold of three was crossed last week; the count is now five, and the failure mode (configuration-specific hardcoding that looks like capability gains) generalizes across long-horizon tasks, visual reasoning, GUI execution, industrial rules, and now GPU kernels.
Reasoning efficiency now has three orthogonal interventions on the same trajectory, all introduced inside 48 hours. NudgeRL from 2026-05-18 (strategy-context-conditioned rollouts that match vanilla GRPO at 8x larger rollout budgets via inter-and-intra-context reward decomposition) changes what gets generated. CIPO from 2026-05-18 (correction-oriented supervision that pairs failed prefixes with correct continuations from the same model's adjacent success rollouts, with the cleanest pass@K > pass@1 evidence to date) recycles the failures. PUMA today stops generation when reasoning converges, at 26.2% token reduction with preserved accuracy across five LRMs and five benchmarks. The composition is one experiment away, and the diagnostic is pass@K under the composed system versus the sum of individual contributions. Super-additive means three different bottlenecks; sub-additive means overlap.
Layer-parallel inference joins the parallel-decoding family with SNLP today. Token-parallel methods (speculative decoding, Orthrus from 2026-05-14 with its dual-view diffusion on a shared cache producing bit-identical AR output at up to 7.8x, SDAR from 2026-05-15 with its sigmoid-gated self-distillation) have been the wiki's content for parallel decoding. SNLP opens layer-parallel within a single forward pass via Newton-style corrections with architecture-induced surrogate Jacobians. The HC Newton variant lines up with the multi-head Compressed family Raschka surveyed on 2026-05-17 (DeepSeek V4 mHC was specifically called out as one of the structurally novel May open-model architectures). The SNLP-aware regularization improving baseline PPL by 4.7-23.4% is the surprise, and it mirrors what Make Each Token Count showed at the KV-eviction layer: an inference-acceleration training objective that is also a quality-improving training objective. Two independent data points now suggest the inference-speed-vs-quality tradeoff is not as rigid as defaults assume.
Cursor's Chinese-open-weight base is now a stable pattern, not transitional. Composer 2 was built on Moonshot's Kimi K2. Composer 2.5 today is built on Kimi K2.5. The SpaceXAI follow-on at 10x compute on Colossus 2 is training a much larger model from scratch, but the current production frontier still sits on a Chinese open-weight base. The wiki tracked the same pattern on 2026-04-30 with Composer 2. For a frontier US coding agent shipping to enterprise customers under a heavy 89% Anthropic-plus-OpenAI revenue concentration, depending on a Chinese open-weight base for the production frontier is the most significant supply-chain claim the wiki tracks. Whether the from-scratch model with SpaceXAI is the path off Kimi or whether Composer 3 sits on Kimi K3 is the load-bearing 30-60 day signal.
Industry Pulse
- Cursor ships Composer 2.5 on Kimi K2.5 with SpaceXAI follow-on at 10x compute (Cursor blog · The Decoder). Second consecutive Cursor frontier coding model on a Chinese open-weight base, matching Opus 4.7 and GPT-5.5 at a fraction of the cost.
- NVIDIA Vera CPU hand-delivered to Anthropic, OpenAI, SpaceXAI, Oracle Cloud (NVIDIA blog). VP Ian Buck personally delivered the first Vera chips, NVIDIA's first custom CPU purpose-built for agentic AI workloads.
- Anthropic ships Claude Code Fast Mode default on Opus 4.7 plus prompt-cache diagnostics (fast-mode docs · cache diagnostics). 2.5x speed at higher per-token rate, plus the most-requested cache-miss diagnostic feature for high-cache-rate workloads.
- Anthropic to brief global financial regulators on cyber flaws found by Claude Mythos Preview (The Decoder). First wiki entry of an AI lab acting as security-research stakeholder at the national-regulator level.
- AI startup revenue hits $80B with Anthropic plus OpenAI capturing 89% (The Decoder via The Information). Foundation-model concentration widens further; vertical AI startups continue to struggle on moats.
- Pope Leo XIV presents first AI encyclical 2026-05-25 with Anthropic co-founder Chris Olah as guest speaker (The Decoder). First AI encyclical in Catholic history, with an interpretability-research link via Olah's invitation.
- Elon Musk appeals $134B OpenAI loss to 9th Circuit (The Decoder). Musk frames the time-bar finding as a "calendar technicality" because the for-profit conversion ran in three stages.
- MAGA-aligned coalition asks Trump for mandatory frontier-AI safety testing executive order (The Decoder). Second concrete legislative push on this axis since 2026-05-13.
Worth Watching
- ZEDA applied to the full open-MoE wave. 30-60 days. Falsifiable: apply to Gemma 4 26B-A4B, DeepSeek V4 Flash, Kimi K2.6, MiMo-V2.5-Pro. If FLOP-reduction-at-marginal-loss holds on three of four, post-training static-to-dynamic conversion is robust. If it collapses on one, the failure identifies a specific expert-specialization fragility.
- Full four-axis MoE integration paper. 30-60 days. Composition is MoE-muP MSSP-scaled pre-train + BEAM per-token activation + HodgeCover resident compression + ZEDA dynamic conversion. End-to-end FLOPs at preserved accuracy versus the maximum of individual contributions is the load-bearing test.
- CompactAttention plus head-role compression plus learned eviction on one forward pass. 30 days. Diagnostic: do the speedups stack, or do all three share a memory-bandwidth bottleneck.
- EndPrompt at 128K and 256K. 30-60 days. Paper reports only 8K-to-64K. Whether the compute discount holds or grows at longer targets decides whether every open-model release adopts it.
- SNLP-aware regularization in a frontier pre-training run. 60-90 days. The 4.7-23.4% PPL improvement plus 2.3x wall-clock at Nanochat scale is large enough that a major lab adopting it in the next foundation-model run is plausible. Watch the next frontier release for any layer-parallel-inference claim.
- PUMA composed with CIPO and NudgeRL. 30 days. Single experiment. Pass@K under composition versus sum of individuals: super-additive means three bottlenecks, sub-additive means overlap.
- NVFP4 stack adopted for text LLMs on Blackwell. 60-90 days. LongLive-2.0 is the video template. Whether a text LLM ships with the same NVFP4 W4A4 plus NVFP4 KV cache plus asynchronous decoding stack is the wider-deployment signal.
- AgentKernelArena shape-hardcoding fix. 60 days. The PyTorch-to-HIP unseen-shape correctness drop is the failure mode. Whether parameterized generation or shape-aware prompting closes the gap is the harness-level test.
Also today
- Lance (arXiv 2605.18678): dual-stream MoE for unified multimodal modeling, beating prior unified open-source models on image and video generation.
- KVPO (arXiv 2605.14278): ODE-native GRPO for AR video alignment that routes exploration through the KV cache, third independent KV-as-alignment-substrate use this month.
- AtlasVA (arXiv 2605.17933): three-layer visual skill memory for teacher-free VLM agents with self-evolving danger and affinity atlases.
- SkillsVote (arXiv 2605.18401): lifecycle governance of agent skills lifting GPT-5.2 by 7.9pp on Terminal-Bench 2.0 and 2.6pp on SWE-Bench Pro.
- Code as Agent Harness (arXiv 2605.18747): survey framing code as operational substrate for agent reasoning, acting, and verification.
- NGM (arXiv 2605.16893): training-free zero-parameter memory module lifting Qwen3 averages by 0.5-1.2 points across 8 benchmarks via causal N-gram encoding plus cosine-gated injection. → summary
- MixSD (arXiv 2605.16865): mixed contextual self-distillation for knowledge injection, retaining up to 100% of held-out base capability where standard SFT retains as little as 1%. → summary
- Maximum activations in open LLMs (arXiv 2605.15572): activation maxima span 4 orders of magnitude across 27 checkpoints, with MoE peaks 14-23x lower than dense counterparts. → summary
- DiHAL (arXiv 2605.14368): first principled selection rule for where diffusion should enter a pretrained transformer via geometry-based proxies on hidden states. → summary
- Monitoring the Internal Monologue (arXiv 2605.18549): probe-trajectory features across CoT reach 95% AUROC for LRM safety monitoring where single-token pooling collapses. → summary
- AI for Auto-Research roadmap (arXiv 2605.18661): four-phase survey finding generated ideas degrade after implementation and research code lags pattern-matching benchmarks. → summary
Sources ingested today: HF (34 papers; 9 substantive), Gmail (2 starred), RSS (19 entries), Twitter morning slot (35 tweets, 7 articles), Kurate cs.AI + cs.LG + rising authors, Reddit (8 subs). Wiki pages updated: 13.