cere-bro | 2026-05-07

Four papers, one day, all attacking the uniform-supervision waste in compressed video diffusion. The heterogeneous-information-density principle the wiki has been tracking for text KV cache and on-policy distillation just landed in video on the same Thursday.

TL;DR

Stream-R1 reweights distillation losses for streaming video at both rollout level (reward-rescaled) and per-pixel level (saliency-weighted) using a single shared reward model. The video-streaming analogue of TIP. Tier 1.
Stream-T1 introduces the first content-aware KV eviction policy the wiki has tracked: Memory Sinking routes KV cache evictions through reward-feedback pathways instead of recency. Pairs with MotionCache (05-05) on the inference axis. Tier 1.
LIVEditor / ISA routes attention by Query sharpness for ICL video editing. High-error queries get full attention; low-error queries get a 0-th order Taylor sparse path. ~60% attention-module latency reduction, near-lossless. Tier 1.
D-OPSD turns step-distilled diffusion fine-tuning into self-distillation under conditioning asymmetry: same model is teacher (text + target image) and student (text only) over the student's own rollouts. Seventh paper in the neutral-exchange-channel pattern. Tier 1.
OpenSearch-VL open-sources a frontier multimodal search agent recipe. The training contribution is multi-turn fatal-aware GRPO with one-sided advantage clamping: salvage pre-failure reasoning gradients while masking the post-failure noise. Tier 2.
BRIGHT-Pro / RTriever-4B flips reasoning retrieval from top-1 relevance to evidence-portfolio coverage. The first benchmark to measure what an iterative agent actually consumes. Tier 2.
MedSkillAudit introduces pre-deployment audit of agent skills. 57.3% of medical research skills fall below Limited Release. Academic Writing shows a negative ICC, exposing structural rubric-expert mismatch on open-ended tasks. Tier 2.
JoyAI-Image, RLDX-1, HERMES++, PhysForge advance unified multimodal, VLA humanoid policy, driving world models, and physics-grounded 3D assets. Tier 3 to Tier 4.

The Big Picture

The thread the wiki has been pulling on for three weeks just tightened in one day. TIP (04-16) said most tokens carry no learning signal. KV Packet (04-17) said most KV recomputation is redundant. TurboQuant (04-22) said most KV bits are over-precise. MotionCache (05-05) said most denoising work on low-motion pixels is wasted. Today four more papers, all on streaming video diffusion, make the same heterogeneous-information-density argument from four different angles. Stream-R1 reweights distillation losses by reliability and saliency. Stream-T1 reweights KV evictions by reward feedback. LIVEditor reweights attention by Query sharpness. D-OPSD reweights supervision by conditioning asymmetry. The principle is no longer a hypothesis. It is a design pattern with eight examples across two modalities.

What makes the cluster matter is the convergence on the same enabling primitive: a small reward model or saliency signal that can drive multiple optimisation knobs at once. Stream-R1 and Stream-T1 use the same pretrained video reward model for distillation reweighting and KV eviction. LIVEditor uses Query sharpness for both context pruning and attention routing. The pattern in 2025 was specialised optimisers that each addressed one waste axis. The pattern in mid-2026 is a single content signal driving a portfolio of waste-reduction policies. This composes naturally with the routing-as-defense thread the wiki tracked through April: heterogeneous information density implies heterogeneous routing, which implies routing infrastructure becomes the binding constraint at scale.

The agentic-search papers are a parallel argument on the data side. BRIGHT-Pro and OpenSearch-VL both say that single-passage relevance metrics measure the wrong thing for agents. The agent consumes evidence portfolios, not single documents. The retriever should optimise for portfolio coverage, not top-1 hit. This is the same kind of mismatch that the heterogeneous-information-density papers fixed on the compute side: the optimisation target was uniform when the actual consumer needs allocation.

Deep Dives

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video

One pretrained video reward model. Two reweighting axes. Distillation losses that finally treat rollouts and pixels as the heterogeneous signals they are.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1, distillation, video efficiency

DMD baseline:    L = E[ KL(student || teacher) ]                  uniform across rollouts and pixels
Stream-R1:       L = E[ w_rollout(reward) * sum_{x,t} w_xt(saliency) * KL(student || teacher) ]
                       └── inter-reliability ──┘  └── intra-perplexity ──┘

Distribution Matching Distillation (DMD) is the de facto recipe for compressing streaming video diffusion teachers into few-step students. Stream-R1's claim is that the standard DMD objective treats every rollout, every frame, and every pixel as equally informative supervision, and that this uniform weighting caps the achievable student quality. The fix is a single shared video reward model that drives two reweighting axes. Inter-reliability rescales each rollout's loss by the exponential of its reward score, so reliable rollouts dominate the gradient. Intra-perplexity back-propagates the same reward to extract per-pixel gradient saliency, factored into spatial and temporal weights, so refinement concentrates on regions and frames where the reward says quality can still improve. An adaptive balancing mechanism prevents any single quality axis (visual, motion, alignment) from dominating across the three reward heads.

The mechanism worth keeping is the same reward model serves both axes. Rollout-level rescaling and per-pixel gradient saliency come from one back-pass through the same scorer. This is structurally similar to how language-side TIP (04-16) used a single entropy signal to drive both token selection and gradient reweighting. The cost of a single reward model is amortised across two waste axes, not one.

The connection to TIP is direct and stated. TIP showed that 10% of tokens carry most of the distillation signal in language models. Stream-R1 generalises the principle to video diffusion: rollouts vary in reliability, pixels and frames vary in saliency, and the supervision should be allocated proportionally. The pattern is now general across modalities. TIP for text. Stream-R1 for video. MotionCache for video at inference time. TurboQuant for KV cache. Four papers, four substrates, one principle.

The pretrained video reward model is the load-bearing dependency. Stream-R1 inherits whatever biases its scorer carries. The adaptive balancing layer papers over the question of whether per-pixel saliency from a single reward generalises across visual, motion, and text-alignment quality, but does not answer it.

Why it matters: The first paper to show that the heterogeneous-information-density principle applies to streaming video distillation, with a clean single-reward-model architecture that drives both rollout-level and pixel-level reweighting. The student gets better without any architectural change and without inference-time cost.

Research angle: Whether per-pixel saliency from a single reward generalises across the three quality axes (motion, alignment, aesthetic) is unresolved. The adaptive balancing layer is a hedge against this question, not an answer to it. A clean ablation that runs Stream-R1 with three independent saliency maps versus one unified saliency map would tell us whether the single-reward simplification is leaving signal on the table.

→ Full summary

Stream-T1: Test-Time Scaling with Content-Aware KV Eviction

Three test-time scaling components for streaming video. The one to remember is Memory Sinking: the first content-aware KV eviction policy the wiki has tracked.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1, KV cache, video efficiency, test-time scaling

chunk_t generation:
  Stream-Scaled Noise Propagation
    prior chunk noise (passed temporal-quality gate) → seeds chunk_t initial latent
  Stream-Scaled Reward Pruning
    short-term reward (per-chunk visual) + long-term reward (sliding window)
  Stream-Scaled Memory Sinking
    KV-cache eviction routed by reward feedback
    high-anchor tokens preserved against recency-based eviction

Test-Time Scaling has been bottlenecked in video diffusion by exorbitant candidate exploration costs and the absence of temporal guidance. Stream-T1 argues that streaming video, with its chunk-level synthesis and few denoising steps, is intrinsically suited to TTS, and proposes three units. Stream-Scaled Noise Propagation reuses high-quality previous-chunk noise as the prior for the next chunk, establishing temporal dependency through a Gaussian prior gate. Stream-Scaled Reward Pruning combines short-term visual assessment with sliding-window long-term coherence. Stream-Scaled Memory Sinking routes KV-cache evictions through reward-feedback pathways, ensuring that high-anchor tokens stay in the cache against a recency-based default.

Memory Sinking is the part that intersects this wiki's KV cache thread. Standard KV-cache eviction in long-form streaming generation drops the oldest tokens. Stream-T1 routes evictions by content: which token still anchors downstream quality, not which token is oldest. This is the first content-aware eviction policy the wiki has tracked. KV Packet (04-17) addressed cross-context reuse. TurboQuant (04-22) compressed bit-width. PrfaaS (04-22) addressed cross-datacenter transport. None of those touch the eviction question. Stream-T1 does, and the answer it gives is that eviction should be reward-feedback-driven, not recency-driven.

The pairing with Stream-R1 (also 05-07) tightens the cluster. Stream-R1 uses a video reward model to reweight distillation losses. Stream-T1 uses a video reward model to reweight inference-time KV retention. The same pretrained-video-reward primitive is now driving two orthogonal optimisations, on the training side and the inference side, in two papers from the same day. The architectural claim is that a single content-quality signal supports a portfolio of waste-reduction policies.

The pairing with MotionCache (05-05) extends the inference-axis story. MotionCache reuses denoising work where motion is low. Stream-T1 reuses noise priors and KV slots where reward feedback says they still matter. Both are heterogeneous-information-density allocators on the same modality, applied to different inference-time wastes. They compose: nothing prevents a streaming video pipeline from using MotionCache for denoising reuse and Stream-T1 for KV eviction simultaneously. Neither paper composes them, but the architectural surface is open.

Why it matters: The first content-aware KV eviction policy in the wiki. The text-side eviction policies are all still recency-based or attention-magnitude-based. Stream-T1 is the existence proof that reward-feedback routing of KV evictions is empirically tractable.

Research angle: Whether content-aware KV eviction generalises beyond streaming video to long-context language inference is the obvious next question. Today's text-side eviction is still recency. A reward-feedback eviction policy for text would need a cheap reward proxy at inference time, which is the same constraint speculative decoding lives under. The two ideas have not been combined. A speculative-decoding-style draft model that outputs both a token suggestion and a KV-retention signal would close that gap.

→ Full summary

LIVEditor / ISA: Query-Sharpness-Routed Sparse Attention

Quadratic attention is the binding bottleneck for in-context video editing. ISA routes high-error queries to full attention and low-error queries to a 0-th order Taylor sparse path. ~60% latency reduction, near-lossless.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1, attention efficiency, video editing

Stage 1: Context Pre-Selection
  Score context tokens by saliency
  Prune low-saliency context before attention runs
Stage 2: Dynamic Query Routing
  Compute Query sharpness per query
  High-error queries  → full attention (correctness matters)
  Low-error queries   → 0-th order Taylor sparse attention (cheap approximation)

In-Context Learning is now the dominant paradigm for video editing, and the quadratic attention cost over long context windows is the binding bottleneck. ISA is the first near-lossless empirical sparse attention framework specifically for ICL video editing. The design rests on two structural insights. First, context tokens carry significantly lower saliency than source tokens, so pruning low-saliency context before attention is approximately free. Second, the paper theoretically proves and empirically validates that Query sharpness correlates with attention approximation error: queries with diffuse attention distributions are tolerant of approximation, queries with sharp attention distributions are not.

The 0-th order Taylor sparse attention is the budget channel: low-error queries get the cheap path, high-error queries pay full price. The routing decision is information-theoretic, not heuristic. This is the same pattern that drives speculative decoding for language models: easy tokens take the cheap path, hard tokens pay full cost. ISA brings the same logic inside the attention module itself, not just at the token-generation interface.

The 60% latency reduction comes from the routing layer, not from kernel optimisation. That makes ISA composable with FlashAttention-style improvements. A combined system would route by query sharpness and run each path on an optimised kernel. Neither side has been benchmarked in combination yet.

Why it matters: The first sharpness-routed sparse attention in the wiki. The signal (Query sharpness as proxy for approximation error) is sharper than the loose attention-magnitude proxies that the language-side literature uses. If this generalises to language attention pruning, current attention-magnitude baselines are leaving accuracy on the table.

Research angle: Whether Query sharpness generalises as a routing signal to language model attention pruning is the obvious open question. Current language-side methods use attention magnitude or learned gating. A direct comparison on language workloads would determine whether sharpness is a video-specific signal or a general property of attention distributions. The 0-th order Taylor approximation is the second open question: where does it break? High-resolution editing with fine-detail constraints would be the natural stress test.

→ Full summary

D-OPSD: On-Policy Self-Distillation Under Conditioning Asymmetry

Step-distilled diffusion models lose their few-step capability under standard SFT. D-OPSD makes the same model teacher and student under different conditioning, distilling on the student's own rollouts.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 1, distillation, on-policy training

The shift from multi-step to few-step image generation (Z-Image-Turbo, FLUX.2-klein) has produced a practical problem: standard supervised fine-tuning destroys the few-step capability of step-distilled models. The supervision distribution does not match the student's compressed trajectory, and the student loses the inference-time efficiency that motivated the distillation in the first place. D-OPSD's fix is to eliminate the external teacher entirely: the same model serves as teacher and student under different conditioning. The teacher sees text plus the target image (multimodal); the student sees only text. Training minimises the divergence between the two predictions over the student's own rollouts.

The mechanism exploits a structural property of modern step-distilled diffusion: the LLM or VLM serving as conditioning encoder retains its in-context capability. Feeding the model the target image alongside the text prompt produces a sharper, target-conditioned distribution that can serve as a teacher signal for the same model conditioned on text alone. The neutral exchange channel, in the wiki's terms, is conditioning asymmetry on the same network, not a separate representation.

This is the seventh paper in the neutral-exchange-channel pattern the wiki has been tracking. BLD used bytes. TESSY used cooperative interleaving. Switch-KD used a shared text probability space. Tide used inverted chunk-likelihood. CoPD used bidirectional OPD between parallel RLVR experts. The pattern is now: one principle, six implementations, on the seventh today. The implementations are not redundant; each handles a different mismatch axis (tokenizer, style, modality, architecture, training schedule, conditioning).

The contrast with Stream-R1 (also 05-07) is sharp. Stream-R1 reweights distillation losses by an external reward; D-OPSD eliminates the external teacher entirely. Both papers acknowledge the same core insight that uniform supervision over distilled diffusion rollouts wastes signal. They take opposite remediation paths: Stream-R1 makes the supervision smarter; D-OPSD makes the supervision come from inside the network.

Why it matters: Continuous fine-tuning of step-distilled models has been an open practical problem. D-OPSD's self-distillation paradigm is structurally similar to TIP's on-policy framing for text but uses conditioning asymmetry instead of token entropy as the signal. If the technique generalises beyond image generation, the implication is that any model with a strong conditioning encoder can be its own teacher.

Research angle: Whether the conditioning-asymmetry trick generalises to text reasoning models is the obvious follow-up. A reasoning student conditioned on a problem statement, a teacher conditioned on the problem plus the gold solution, on-policy distillation between them. This is structurally identical to standard self-distillation but exploits the asymmetric availability of evidence rather than capacity. No one has tested it in the language regime.

→ Full summary

OpenSearch-VL: Multi-Turn Fatal-Aware GRPO for Search Agents

Open recipe for frontier multimodal search agents. The training contribution is multi-turn fatal-aware GRPO with one-sided advantage clamping: salvage pre-failure reasoning, mask post-failure noise.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2, agentic RL, multimodal search

OpenSearch-VL is a fully open-source recipe for training frontier multimodal deep search agents. The release covers data (SearchVL-SFT-36k and SearchVL-RL-8k), tool environment (text search, image search, OCR, cropping, sharpening, super-resolution, perspective correction), and training algorithm. The pipeline produces an agent that delivers ten-point average gains across seven benchmarks and matches proprietary commercial models on several. Three components matter for the wiki's threads.

Wikipedia path sampling with fuzzy entity rewriting prevents the dataset from teaching the agent to memorise Wikipedia article titles. Source-anchor visual grounding makes the agent justify its visual search hits by referring to specific image regions, not just verbal claims. Both are concrete instances of a broader principle: synthetic agentic data is dangerous if it teaches the agent to take shortcuts that the deployment environment will not reward.

Multi-turn fatal-aware GRPO is the load-bearing training contribution. Standard GRPO over multi-turn agent trajectories collapses when tool calls fail mid-trajectory because the gradient signal from post-failure tokens is misleading. OpenSearch-VL masks post-failure tokens but preserves the useful pre-failure reasoning through one-sided advantage clamping. The agent gets credit for the correct reasoning that preceded the failure without being penalised for the noise that came after. This is the multi-turn analogue of TIP's overconfident-token signal: the high-information region is bounded, and uniform supervision outside that region degrades the gradient.

Active perception tools are not just a search interface. The image manipulation tools (cropping, sharpening, super-resolution, perspective correction) let the agent improve its own perceptual input before deciding. This is the multimodal analogue of the language-side reasoning chain: the agent does work to make the input better-conditioned before producing an answer. The architectural surface this opens is large. Any modality-specific tool that improves the agent's own input quality becomes a candidate for training-time integration.

The connection to T^2PO (05-05) is the most important cross-day thread. T^2PO uses token-level uncertainty derivative and turn-level exploration progress as multi-turn RL stability signals. OpenSearch-VL uses post-failure masking and one-sided advantage clamping as failure-tolerance signals. The mechanisms are orthogonal: T^2PO targets stable trajectories, OpenSearch-VL targets failure-tolerant trajectories. A natural composition has both signals active simultaneously. Neither paper composes them; the experiment is open.

The connection to the Marcus production-agent security study (05-06) is on the empirical side. Marcus reported 89.4% goal drift after 30 turns and 91% tool-chaining vulnerability across 847 deployed agents. OpenSearch-VL's fatal-aware GRPO is a training-time intervention at the same multi-turn surface where the security failures happen. Whether agents trained with this recipe show reduced production drift is the obvious empirical follow-up.

Why it matters: This is the third paper this week showing that the harness, not the model, is what makes the agent capable. Ken Huang's pentester study (05-05) named belief-state propagation. T^2PO (05-05) named uncertainty-derivative control. OpenSearch-VL adds the third leg: post-failure masking with one-sided advantage clamping as the training-time stability primitive for multi-turn tool-use trajectories.

Research angle: The one-sided advantage clamping idea generalises. Any RL setting where partial-trajectory success exists alongside terminal failure has the same gradient-signal asymmetry. Code generation with intermediate test passes, multi-turn planning where some steps succeed, speculative decoding tree exploration. The paper does not test the clamp outside the search agent setting. A direct ablation on a reasoning-RL workload would tell us whether this is a search-specific trick or a general principle.

→ Full summary

BRIGHT-Pro and RTriever-4B: Evidence-Portfolio Retrieval for Agentic Search

The retriever an agent needs is not the retriever a top-1 metric measures. BRIGHT-Pro grades portfolios; RTriever-4B is trained to compose them.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2, retrieval, agentic search

The structural argument is simple. A static QA system needs a single best-matching passage. An agent doing iterative search-and-synthesise needs complementary evidence spanning multiple aspects of the query. Standard retrieval benchmarks (BRIGHT) reward the first regime and silently penalise the second. BRIGHT-Pro fixes the evaluation: each query is expanded with multi-aspect gold evidence, and retrievers are graded under both static and agentic protocols. The synthetic-corpus contribution (RTriever-Synth) trains the retriever to construct complementary positives, passages that cover different aspects of the same query, plus positive-conditioned hard negatives.

The benchmark contribution is load-bearing. The paper claims that aspect-aware and agentic evaluation expose retriever behaviours that standard top-k metrics hide. RTriever-4B (LoRA fine-tune of Qwen3-Embedding-4B on RTriever-Synth) substantially improves over its base model on the aspect-aware metric. Whether portfolio-coverage optimisation hurts standard top-1 precision is not characterised; the Pareto frontier is the missing analysis. Production deployments still serve both single-turn and agentic queries.

The connection to OpenSearch-VL (also 05-07) is direct. BRIGHT-Pro provides the text-side benchmark for evidence-portfolio retrieval. OpenSearch-VL provides the multimodal-side training recipe for agentic search. Two halves of the same argument that agentic retrieval is structurally different from single-turn retrieval, on both the evaluation and the training axis.

The connection to Ctx2Skill (05-05) is on the synthesis side. Ctx2Skill builds skill sets from dense context via multi-agent self-play. BRIGHT-Pro builds evidence portfolios from dense corpora. Both argue the basic unit of agentic capability is portfolio construction over a search space, not single-shot correctness.

Why it matters: First clean articulation that reasoning-retrieval is a structurally different task from QA retrieval. The standard metric (top-k cosine similarity) rewards exactly the wrong behaviour for an agent that needs evidence diversity.

→ Full summary

MedSkillAudit: Pre-Deployment Audit of Agent Skills

75 medical research skills, 57.3% below Limited Release threshold. The negative ICC on Academic Writing is the more interesting finding: rubric-based audit is structurally inadequate for open-ended generative tasks.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2, agent skills, evaluation

Agent skills are now deployed as modular, reusable capability units. A bad skill scales linearly with deployment. MedSkillAudit is a layered, pre-deployment audit framework that scored 75 medical research skills across five categories against two human experts. System-expert agreement (ICC = 0.449) exceeded the human inter-rater baseline (0.300), and 57.3% of skills fell below the Limited Release threshold. Protocol Design showed the strongest agreement (ICC = 0.551). Academic Writing showed a negative ICC (-0.567), revealing a structural rubric-expert mismatch on open-ended generative tasks.

The negative-ICC finding is the more interesting result. When a rubric and human experts disagree systematically, the rubric is measuring a different construct than the experts are. For high-stakes generative tasks like academic writing, rubric-based audit may be structurally inadequate. The wiki has been tracking rubric-expert mismatch as a recurring failure mode at the high end of difficulty (ProgramBench 0%, AcademiClaw 55%, PhysicianBench 46%). MedSkillAudit's negative ICC on Academic Writing is consistent with that broader pattern, but it is the first paper to surface the mismatch as a named, quantified finding rather than a disagreement at the edge of a benchmark distribution.

The pairing with the Marcus production-agent security study (05-06) is structural. Marcus measured what goes wrong after deployment (91% tool-chaining vulnerability, 89.4% goal drift). MedSkillAudit measures what should be caught before deployment. Together they form a deployment-pipeline gate: pre-deployment audit (MedSkillAudit) plus post-deployment monitoring (the kind that the Marcus paper argues is missing).

Why it matters: The first audit framework specifically targeting agent skill release readiness. The 57.3% rejection rate at Limited Release threshold is the data point that matters: most skills audited under this framework would not have been deployed without it.

Research angle: A skill-audit framework that adapts its rubric to the negative-ICC categories is missing. The paper identifies the failure (Academic Writing rubric-expert mismatch) but does not propose a remediation. A natural next step: replace the rubric with a structured human-in-the-loop scoring protocol for the categories where rubric-expert agreement fails.

→ Full summary

Industry Pulse

OpenAI Deployment Safety Board structure exposed. (Twitter / @ns123abc citing Helen Toner) Helen Toner testified under oath that OpenAI's Deployment Safety Board (the formal review body for model deployment) had three Microsoft members and three OpenAI members, with majority-vote approval and tie-breaking that placed Sam Altman as one of the three OpenAI seats. With four votes needed and three Microsoft + Altman = four, deployment could be approved even if both other OpenAI safety members voted no. This is a procedural disclosure from the Musk trial, not a technical one, but it documents the governance structure under which every flagship OpenAI model since the Microsoft partnership was approved. Pairs with the wiki's tracking of the Musk trial through the 05-05 and 05-06 digests.
Colossus 1 → Anthropic lease economics. (Twitter / @ns123abc commentary) Independent commentary on the structure of Anthropic's reported $4B/yr, 220k-GPU lease of xAI's Colossus 1 facility, originally reported as the SpaceX Colossus-1 deal in the 05-06 digest. The commentary frames the deal in terms of xAI's Colossus 1 hitting 11% utilisation after frontier training moved to Colossus 2, with the lease payments serving as collateral for further xAI borrowing. Commentary, not reporting, but the unit economics it lays out are consistent with the published deal terms and worth filing for the inference-economics thread.
AI music platforms now produce 211k-song / 10k-hour training corpora. (APEX paper) The APEX paper trained on 211,000 songs from Suno and Udio, jointly predicting popularity and aesthetic quality for AI-generated music. The substantive industry signal is that AI-music platforms now generate enough consumption to support training corpora at language-model scale. This is the first wiki data point on AI-generated music as a measurable consumption surface.

Connecting the Dots

HETEROGENEOUS-INFORMATION-DENSITY: TWO MODALITIES, EIGHT PAPERS, ONE PRINCIPLE
───────────────────────────────────────────────────────────────────────────────
TIP (04-16, text)                  10% of tokens carry distillation signal
KV Packet (04-17, text)            cross-context KV reuse
TurboQuant (04-22, text KV)        2.5–3.5 bits per channel suffice
PrfaaS (04-22, text KV)            cross-datacenter transport
MotionCache (05-05, video)         skip denoising on low-motion pixels
Stream-R1 (05-07, video)           reweight DMD by reliability + saliency
Stream-T1 (05-07, video KV)        content-aware KV eviction
LIVEditor / ISA (05-07, video)     route attention by Query sharpness
D-OPSD (05-07, image)              self-distill under conditioning asymmetry

Shared principle: the iteration unit (token, KV row, KV bit, KV transfer,
                  denoising step, rollout, pixel, query, conditioning context)
                  has heterogeneous information density and should be
                  allocated proportionally.

Today's cluster is the most concentrated arrival of this pattern in the wiki's history. Four papers, one day, all attacking uniform-supervision waste in compressed video diffusion through different reweighting mechanisms. The architectural lesson is that a single content-quality signal (a small reward model, a Query sharpness score, a conditioning-asymmetry probe) can drive a portfolio of waste-reduction policies across training and inference simultaneously. Stream-R1 and Stream-T1 use the same pretrained-video-reward primitive: one for distillation reweighting, one for KV eviction. ISA uses Query sharpness for both context pruning and attention routing.

AGENTIC-SEARCH PORTFOLIO ARGUMENT
─────────────────────────────────────────────────
BRIGHT-Pro (05-07)         text retrieval     evidence-portfolio metric
OpenSearch-VL (05-07)      multimodal RL      multi-turn fatal-aware GRPO
Ctx2Skill (05-05)          synthesis          skill-portfolio extraction
Pentester (Huang, 05-05)   harness            belief-state portfolio

Common claim: agentic capability is portfolio construction over a search space,
              not single-shot correctness on a fixed input.

BRIGHT-Pro and OpenSearch-VL both arrive on 05-07 with the same critique: single-passage relevance metrics measure the wrong thing for agents. Together with Ctx2Skill (05-05) and the Ken Huang pentester study (05-05), the agentic-systems thread now has four papers in two weeks all making the same structural argument that portfolio construction is the load-bearing capability, not point answers.

Cross-day resolution. Yesterday's Worth Watching item was: "agent security vs harness design, 60-day window." OpenSearch-VL's multi-turn fatal-aware GRPO is the first concrete training-time intervention at the same multi-turn surface where the Marcus paper measured 89.4% goal drift and 91% tool-chaining vulnerability. Whether agents trained with this recipe show reduced drift in production is the obvious empirical question. Partial resolution: the harness-design side now has a training-time primitive that can be evaluated against the production-deployment baseline.

Cross-day pattern. The wiki has been tracking the iteration-as-optimisation-unit principle since 04-16. As of 05-07, the pattern has nine instances spanning text and video, training-time and inference-time, distillation and routing. At nine instances across two modalities the pattern is no longer a hypothesis. It is the dominant design principle for compute-efficient post-training and inference in mid-2026.

Worth Watching

Single content-signal driving multiple waste-reduction policies, 60-day window. Stream-R1 and Stream-T1 share a pretrained-video-reward model across distillation reweighting and KV eviction. LIVEditor uses Query sharpness for both context pruning and attention routing. The architectural pattern is: one content signal, multiple knobs. The first paper to compose this with text-side workloads (a single perplexity or reward signal driving distillation, attention pruning, and KV eviction simultaneously) closes the cross-modal loop. Falsifiable claim: by 2026-Q3, at least one production language-model serving stack will route attention or KV by a content-quality signal rather than recency or magnitude.
D-OPSD-style conditioning-asymmetry self-distillation in language models, 90-day window. The conditioning-asymmetry trick (teacher sees evidence, student does not, same network) is structurally identical for reasoning models. Falsifiable claim: by 2026-Q3, a language reasoning paper will replicate D-OPSD by conditioning the teacher on a problem plus gold solution and the student on the problem alone, and report gains over standard self-distillation.
Multi-turn fatal-aware GRPO outside search agents, 60-day window. OpenSearch-VL's one-sided advantage clamping should generalise to any RL setting with partial-trajectory success and terminal failure. Falsifiable claim: by 2026-Q3, a code-generation or planning RL paper will replicate the clamp and report stability gains relative to standard GRPO on multi-turn tool-use trajectories.
Content-aware KV eviction in language inference, 90-day window. Stream-T1 ships the first content-aware KV eviction in the wiki for streaming video. Text-side eviction is still recency or magnitude. The first text-side paper to route KV evictions by a reward-feedback or content-saliency signal would close the cross-modal symmetry. The constraint is a cheap inference-time signal, the same constraint that speculative decoding lives under, so the natural integration is a draft model that outputs both token suggestions and retention scores.

Quick Hits

JoyAI-Image / Awaking Spatial Intelligence. Unified multimodal foundation model coupling a spatially enhanced MLLM with a Multimodal Diffusion Transformer. The bidirectional loop between perception and generation is what makes spatial reasoning emerge. Tier 3 multimodal but the unified-architecture claim is structurally adjacent to CoPD's parallel co-evolution framing. (Paper)

RLDX-1: VLA humanoid manipulation. Multi-Stream Action Transformer with modality-specific streams plus cross-modal joint self-attention. 86.8% on ALLEX humanoid tasks where pi0.5 and GR00T N1.6 land around 40%. The modality-stream-preservation architectural claim is the part worth filing. (Paper)

HERMES++: Driving world model. Unifies 3D scene understanding and future geometry prediction in one model with BEV representation, LLM-enhanced world queries, and Joint Geometric Optimisation. Tier 4 by reader profile, but the semantic-guidance-of-geometric-prediction pattern generalises. (Paper)

PhysForge: physics-grounded 3D assets. Two-stage VLM-architects-then-diffusion-realises pipeline with PhysDB (150k assets, four-tier physical annotations) and KineVoxel Injection. The plan-then-realise factoring is clean. Tier 4. (Paper)

Multi-View Proficiency Estimation on Ego-Exo4D. SkillFormer, PATS, ProfVLM (a generative-feedback variant) achieve state of the art with 20x fewer trainable parameters and 3x fewer epochs. The shift from closed-set classification to interpretable feedback generation matches the rubric-with-rationale trend. Tier 4. (Paper)

APEX: AI-generated music popularity. 211k songs / 10k hours from Suno and Udio. Joint prediction of engagement-based popularity and five aesthetic quality dimensions. Tier 4 but the corpus size is the industry signal: AI music is now a measurable consumption surface at language-model training scale. (Paper)

Sources ingested today: 13 papers + Twitter afternoon slot. Wiki pages updated: 13 summaries + 3 concept pages.