cere-bro | 2026-05-08

Four Tier 2 papers from this week converge on the same claim, that RLVR's verifiable rewards are systematically gameable, and ResRL is the first concrete fix. Same week, Lambert reports from inside Meituan, the lab that wrote ResRL.

TL;DR

ResRL decouples positive and negative gradient interference in RLVR via low-rank projection of negative-token hidden states onto the positive subspace. +9.4% Avg@16 on math over Negative Sample Reinforcement, preserves diversity. From Meituan + Chinese Academy of Sciences. Tier 2.
The RLVR failure-modes cluster — ResRL (the fix) plus three Kurate-rated underrated papers, LLMs Gaming Verifiers (RLVR reward hacking), The Verification Tax (audit limits), AI scientists without scientific reasoning. Four papers, one week, same claim that verifiable rewards are gameable.
First Token Knows uses single-decode entropy of the first content-bearing token to match semantic self-consistency hallucination detection at 1/11 the generation cost. AUROC 0.820 vs 0.793 across three 7-8B models. Tier 2 responsible-ai.
When to Think, When to Speak (SxS) introduces disclosure-as-control-action for streaming LLM reasoning. The same tokens both update model state and constitute irreversible public commitment, this paper makes the timing of release a learned policy. Pareto improvements on AIME25 + GPQA-Diamond. Tier 2.
Anthropic's Colossus 1 deal reframes the past three weeks of dev-hostile Anthropic behavior as a capacity-crunch story. Pragmatic Engineer connects "dumber Claude" + "Claude Code revoked" + the SpaceX deal. Simon Willison flags the data center's environmental record. Tier 1 industry.
GitHub reliability collapses to 86% uptime over 90 days under "AI agent-fuelled load spike" (CTO's words). Data integrity incidents, 6-hour Elasticsearch outage, Wiz critical security disclosure, Mitchell Hashimoto publicly leaving after 18 years. Same root cause as Anthropic-Colossus: AI workloads outgrowing infrastructure faster than providers can scale. Tier 1 industry.
Lambert: Notes from inside China's AI labs. Six labs visited, same week as Meituan publishes ResRL on this wiki's main RLVR thread. Most Chinese devs are Claude-pilled despite Claude being banned. Build-not-buy mentality. Nvidia chip desperation. Tier 2 industry essay.
Netflix Tech Blog: State of Routing in Model Serving (Nipun Kumar, Rajat Shah, Peter Chng) surfaced via Gmail Medium digest. Title-level signal only — body not in the wiki yet. Worth a manual read; routing-taxonomy piece directly intersects your Tier 1 #1 area.

The Big Picture

The week's clearest research signal is RLVR's failure modes. Four Tier 2 papers across two different signals (HuggingFace today, Kurate's weekly LLM-tournament leaderboard) converge on the same claim. Verifiable rewards are gameable, mode collapse is a structural side-effect, and the field is actively iterating on fixes. ResRL today is the first concrete fix. It uses a low-rank SVD projection of negative-token hidden states onto the positive subspace, then modulates the gradient by the projection residual. This decouples the semantic distributions that NSR (Negative Sample Reinforcement) was inadvertently penalizing. The +9.4% on Avg@16 math reasoning is the headline, but the deeper claim is that you can preserve generation diversity while fixing reasoning, which the previous RLVR generation could not. The Verification Tax paper from Kurate is the theoretical companion. It establishes fundamental limits on AI auditing in the rare-error regime. The IatroBench paper from Kurate provides the empirical companion, pre-registered evidence that AI safety measures themselves cause iatrogenic harm. Three papers, three vantage points on the same problem.

The cleanest cross-source synthesis the wiki has produced this week is the connection between today's ResRL paper and Nathan Lambert's Notes from inside China's AI labs. ResRL's authors are at Meituan and the Chinese Academy of Sciences. Lambert spent the past week visiting six Chinese labs including Meituan. Today the wiki ingests both. Lambert's piece argues that Chinese labs are culturally optimized for fast-following at the LLM-building game, with students integrated as peers (vs no internships at OpenAI/Anthropic/Cursor), build-not-buy mentality on RL data and environments, and Nvidia chip desperation. ResRL is exactly the kind of deep methodological work that the build-not-buy culture produces. The Western lab equivalent would more likely be a benchmark paper (LLMs Gaming Verifiers from Kurate, by a German+Japanese team, fits this template) than a fix paper.

The industry thread runs parallel. Pragmatic Engineer reframes Anthropic's three-week-long dev-hostility (dumber Claude, Claude Code access revoked) as a capacity-shortage tell, a reading that the SpaceX/xAI Colossus deal confirms. Simon Willison flags the environmental record at Colossus 1 as a brand risk. TLDR AI and The Decoder cover the deal at the level of headline. Lambert's piece notes from China that "most Chinese developers are Claude-pilled despite Claude being banned." So the bottleneck has been Claude itself, and Anthropic just bought another data center to fix it. One small meta-signal worth noting at the end. Two of the resources I evaluated and integrated into the pipeline this morning (kurate.org and the awesome-foundation-agents repo) appeared in your retweets within hours, reposted by @robert_lauko and @tom_doerr respectively. The wiki's source-curation is on the right track.

Deep Dives

ResRL — Negative Sample Projection Residual RL

Decouples positive/negative gradient interference in RLVR via low-rank SVD projection of negative-token hidden states onto the positive subspace. +9.4% Avg@16 on math vs NSR, preserves diversity.

Source: HuggingFace Daily Papers (2605.00380), enriched with alphaxiv overview Links: Paper · Code · Wiki Tier: 2 — RLVR / post-training / Chinese lab work

RLVR mode collapse problem:
  Pass@1 ↑ but Pass@k ↓        ← positive-reward over-incentivization

NSR fix attempt (prior art):
  upweight negative-sample gradients
  →  side-effect: penalizes shared semantic distributions
                  between positive and negative trajectories
  →  Pass@k recovers, Pass@1 limited

ResRL fix (this paper):
  hidden states of negative tokens     SVD projection
  ───────────────────────────────►  onto positive subspace
                                     │
                                     ▼
                       projection residual modulates negative gradient
                       (conservative advantage reweighting)

Result: +9.4% Avg@16 on math, +7.0% Pass@128, diversity preserved.

The mechanism is elegant. The paper theoretically links Lazy Likelihood Displacement to negative-positive head-gradient interference, then derives a single-forward proxy that upper-bounds representation alignment. That proxy guides conservative advantage reweighting. The SVD projection is the operational form, splitting the negative-token hidden representation into two pieces, the part that lives in the positive subspace (semantically shared, do not penalize) and the orthogonal residual (the actual negative signal, penalize this). The 12-benchmark sweep across Mathematics, Code, Agent Tasks, and Function Calling shows the gain is consistent rather than benchmark-specific. Code is open at github.com/1229095296/ResRL.

Why it matters: This is the cleanest fix the wiki has tracked for the RLVR diversity-collapse failure mode that LLMs Gaming Verifiers (Kurate cs.LG #9, arxiv 2604.15149), The Verification Tax (Kurate cs.LG #10), and AI Scientists Without Scientific Reasoning (Kurate cs.AI #5) have been documenting. ResRL does not refute the failure-mode papers, it operates on the gradient-interference half of the problem they describe. The reward-hackability half is unaddressed.

Research angle: Three open questions. (1) Does the SVD projection scale beyond 7B? The paper's experiments are at the 7B-class. (2) How does this interact with Step-Level Optimization (05-02), which detects trajectory stalls at inference time? Both work on the gradient-of-trajectory signal but at training vs inference. Composition is the obvious next paper. (3) The conservative advantage reweighting is a hyperparameter trade-off. What's the lower bound on diversity that ResRL preserves before reasoning gains erode?

→ Full summary

First Token Knows — Single-Decode Confidence for Hallucination Detection

The normalized entropy of the top-K logits at the first content-bearing token of a single greedy decode matches semantic self-consistency on closed-book factual QA at 1/11 the generation cost.

Source: HuggingFace Daily Papers (2605.05166) Links: Paper · Wiki Tier: 2 — responsible-ai / inference efficiency

The result is striking in its parsimony. Across three 7-8B instruction-tuned models (Llama-3.1-8B, Mistral-7B-v0.3, Qwen2.5-7B) and two benchmarks (PopQA and TriviaQA, n=1000 each), the first-token confidence proxy phi_first achieved AUROC 0.820 versus 0.793 for semantic self-consistency. The compute saving is structural, not constant-factor. Semantic self-consistency requires one greedy decode plus ten sampled generations plus an NLI model to cluster them by meaning. phi_first requires one greedy decode and reads the entropy of the first content-bearing token's top-K logits. That's it. The subsumption test (phi_first vs semantic agreement, Pearson 0.54-0.76) plus the logistic ensemble bound (only +0.02 AUROC over phi_first alone) together argue that single-decode confidence captures most of semantic agreement's discriminative power.

The partial-correlation analysis controlling for answer length is the methodological touch that elevates this above a "yet another confidence-calibration paper." The apparent association between phi_first and answer length largely disappears after controlling for correctness, so the signal is real, not a length artifact. The recommendation in the abstract is sharp. First-token confidence "should be reported as a default, low-cost baseline before invoking sampling-based uncertainty estimation." This is a falsifiable claim about the responsible-ai literature, not just about a benchmark number.

Why it matters: Hallucination detection at production scale has been blocked on cost, ten-sample generation per question is not deployable for most agent systems. phi_first is one decode. If this generalizes beyond closed-book short-answer QA to longer-form generation, it changes the cost structure of hallucination guardrails by an order of magnitude.

Research angle: The benchmark is closed-book short-answer factual QA. The interesting question is whether the first-token confidence signal survives when the answer is long-form, structured, or tool-grounded. The wiki has been tracking agent-security failure modes where hallucination plus tool-call permission becomes catastrophic. phi_first as a deployment-time gate at the first action token of an agent rollout is the falsifiable next step.

→ Full summary

When to Think, When to Speak (SxS Interleaved Reasoning)

In single-stream autoregressive generation the same tokens both update internal state and constitute irreversible public commitment. SxS makes the timing of disclosure a learned dual-action policy.

Source: HuggingFace Daily Papers (2605.03314) Links: Paper · Wiki Tier: 2 — LLMs / streaming reasoning

The conceptual frame is the strongest part of this paper. The "silence tax" is the cost of additional private deliberation that postpones first task-relevant content. Naive early streaming risks premature commitments that bias subsequent generation. Standard single-stream autoregressive interfaces couple state-update and public-commitment in the same token, so there is no clean way to "think more before speaking." Side-by-Side (SxS) interleaved reasoning makes disclosure a controllable decision within the standard autoregressive format. The model interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is supported by the reasoning so far.

The training pipeline is two-stage. SFT first acquires the dual-action semantics (think versus speak) using entailment-aligned interleaved trajectories, constructed by matching answer prefixes to supporting reasoning prefixes. Then RL recovers reasoning performance under the new format. The Pareto improvements on accuracy-content-latency trade-offs hold across both Qwen3 architectures (MoE Qwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks under token-level proxies for inter-update waiting time.

Why it matters: This is the first paper the wiki has tracked that names the silence-tax/premature-commitment trade-off as a learnable variable, not an architectural constraint. For agent systems with long reasoning chains (Stream-T1 from yesterday's digest is a video analogue), the disclosure-policy frame is the abstraction that survives across modalities.

Research angle: Two questions. Does the dual-action distinction generalize to tool-calling (where "speaking" is invoking a tool, and a wrong tool call is the irreversible commitment)? And does SxS's entailment-aligned trajectory construction scale, or does the SFT stage saturate at moderate domain breadth? The wiki has been tracking tool-chaining attacks as the agent failure mode where premature commitment matters most.

→ Full summary

Anthropic ↔ Colossus 1 Deal, Read as Capacity Crunch

Three weeks of "dumber Claude" and "Claude Code access revoked" was the surface effect. The xAI/Colossus 1 lease is the underlying signal. Three sources converge.

Source: Pragmatic Engineer + Simon Willison + TLDR AI + Decoder (RSS, 2026-05-07) Links: Pragmatic Engineer · Simon Willison · Wiki Tier: 1 — AI industry / infrastructure / Anthropic governance

Pragmatic Engineer's read is the load-bearing one. The thesis is direct. The "dumber Claude" complaints from the developer community over the past three weeks plus the abrupt removal of Claude Code access from some paid accounts plus the timing of the SpaceX/xAI announcement together suggest Anthropic was capacity-constrained and concealing it. The SpaceX deal is the resolution. Anthropic gets all of Colossus 1's capacity (xAI keeps the larger Colossus 2 for their own work, so Grok is not being deprecated as initial chatter suggested). Simon Willison's contribution is the brand-risk angle. Colossus 1 has a documented bad environmental record. The gas turbines installed to power the facility initially ran without Clean Air Act permits, classified as "temporary," and credible reports link this to Memphis-area hospital admissions for low air quality. Andy Masley, the most prolific debunker of misleading data-center water-and-land critiques, said about Colossus specifically, "I would simply not run my computing out of this specific data center." That is a measured statement from someone who has built credibility defending data centers. Worth taking seriously.

The cross-source pattern is what makes this Tier 1 rather than Industry Pulse. Pragmatic Engineer establishes the demand-side cause. Simon Willison establishes the brand-risk consequence. Lambert's China piece adds an oblique third angle, "most Chinese developers are Claude-pilled despite Claude being banned." If Chinese demand is real and Claude is the bottleneck, the capacity crunch is even tighter than the public posture admits. The wiki has tracked the Anthropic-OpenAI services-companies convergence (05-04) and the Anthropic capital concentration with Amazon (04-22). This is the same arc continuing. Anthropic is execution-bound on inference, not demand-bound.

Why it matters: Anthropic chose to sign with the data center with the worst environmental record in the industry over not having capacity. That is the trade-off the company is willing to make. The "AI data centers are bad for communities" political wave (Utah news cited by Willison) was already cresting before this deal. Today's coverage is the first time the wiki has seen all three sources (Anthropic governance, capacity, environmental brand-risk) coverged in one news cycle.

Research angle: Not a research paper but a research-adjacent question. What is the fastest the AI-data-center political wave can shift Anthropic's enterprise procurement decisions? If a major Anthropic enterprise customer pulls due to Colossus, that's a falsifiable signal within 60 days.

→ Full summary

GitHub Reliability Crisis Under AI Agent-Load

86% uptime over 90 days, data-integrity incidents losing 2,092 PRs, 6-hour Elasticsearch outage hiding pull requests, Wiz critical security disclosure, Mitchell Hashimoto publicly leaving. CTO blames AI agent-fuelled load. Pairs with Anthropic-Colossus as the week's second AI-infrastructure-stress story.

Source: Pragmatic Engineer (Gergely Orosz) — Gmail starred 2026-05-07 Links: Newsletter · Wiki Tier: 1 — AI industry / infrastructure / developer-platform reliability

The numbers are extraordinary. Third-party tracker pegs GitHub at 85.51% uptime ("zero nines") over the past 90 days, down from ~90% the month before. That's 2-3 hours of partial outage per day, on average, every day, for three months. The April 23 data integrity incident is the standout. PRs merged via the merge queue with squash merge produced incorrect merge commits when the merge group contained more than one PR. Commits were silently lost. 2,092 PRs affected, including at Modal and Zipline. Customers had to manually untangle and recover lost commits with zero help from GitHub. The integrity-promise broke. Add the April 27 6-hour Elasticsearch outage that hid PRs and issues from the web UI, the April 28 Wiz disclosure that any actor could git-push to all GitHub repos via a single command before the patch, and the GitHub Actions outages on April 28-29. That's one platform's normal week now.

The cultural-influence signal is Mitchell Hashimoto, founder of HashiCorp and creator of Ghostty, publicly leaving GitHub after 18 years. His direct quote ("I want to be there, but it doesn't want me to be there. I want to get work done and it doesn't want me to get work done") is exactly the kind of public-figure departure that turns a service-level frustration into an industry-narrative event. The COO's response was to "find a huge denominator to make the impact appear small," per Modal engineer Can Duruk's Twitter take, which adds the trust-deficit angle on top of the reliability problem.

GitHub CTO Vlad Fedorov's stated explanation is "AI agent-fuelled load spike." That's the same root cause as Anthropic's Colossus 1 deal. Anthropic responded by leasing the worst-environmental-record data center in the industry to get capacity. GitHub has not yet visibly responded. Two AI-infrastructure stress stories, two major platforms, one underlying pattern: AI workloads are outgrowing infrastructure faster than providers can scale. The wiki should treat this as the first cluster on AI-infrastructure-saturation rather than two isolated incidents.

Why it matters: This is the first time the wiki has documented an AI-coding-agent ecosystem (Codex, Claude Code, Cursor, Aider) being publicly named by a major platform CTO as the cause of platform-level failure. The implicit policy implication is per-agent rate limits at the platform layer, which would change the cost structure of every coding-agent product.

Research angle: Not a research paper, but a research-adjacent question worth tracking. If GitHub introduces per-agent rate limits within 60 days, every coding-agent product's economics shift. If a third major AI-infrastructure provider (Vercel, Cloudflare, AWS) reports similar AI-load stress within 30 days, the wiki should treat AI-infrastructure-saturation as a Tier 1 industry trend, not isolated incidents.

→ Full summary

Lambert: Notes from Inside China's AI Labs

Six labs visited, including Meituan (which today publishes ResRL on this wiki's main RLVR thread). Most Chinese developers are Claude-pilled despite Claude being banned. Build-not-buy mentality. Nvidia chip desperation.

Source: Interconnects AI (RSS, Nathan Lambert) Links: Post · Wiki Tier: 2 — AI industry essay

This is the kind of piece that justifies the Tier 2 industry-essay treatment despite ai-industry being the user's Tier 3 default. Six labs visited in 36 hours (Z.ai, Moonshot, Tsinghua, Meituan, Xiaomi, 01.ai), plus Alibaba Beijing on the way to the hotel. The thesis Lambert lands on is that Chinese labs are culturally optimized for fast-follower LLM-building. Six concrete differences emerge. (1) Students integrated as peers (vs OpenAI/Anthropic/Cursor offering no internships). (2) Less ego in the org chart, less gamifying for individual credit at the expense of model quality, with the Llama-org collapse as the U.S. counterexample. (3) "Most developers are Claude-pilled despite Claude being banned." (4) Build-not-buy on RL data and environments, because the Chinese data-services industry is underdeveloped. (5) Tech-ownership mentality at non-AI companies (Meituan and Ant Group both ship open-weight LLMs because they want their own stack, not because LLMs are their core business). (6) Nvidia chip desperation, with Huawei chips reasonable for inference but not training.

The most striking observation is on philosophy. "Trying to get Chinese scientists to comment on the coming economic uncertainty fueled by AI, questions beyond the capabilities of simple AGI, or moral debates on how models should behave all served to capture the extreme humility of these scientists." The wiki has been tracking the Marcus agent security study (05-06) and the responsible-ai cluster (alignment, interpretability, safety). Lambert's piece argues, gently, that this entire Western preoccupation with how-models-should-behave does not exist as a research-shaping force in Chinese labs. Their role is to build the best model. The implication for the cere-bro framework is that the responsible-ai topic the wiki was just renamed to track may be a structurally Western preoccupation, not a global one.

Why it matters: Direct connection to today's ResRL paper. ResRL's authors are at Meituan + Chinese Academy of Sciences. Lambert was at Meituan this week. The build-not-buy mentality he documents is exactly the mentality that produces the deep methodological work ResRL represents. The reverse pattern, Western-style benchmark-and-critique papers like LLMs Gaming Verifiers from Kurate, comes from German + Japanese teams in Lambert's framing.

Research angle: Not a paper, but a falsifiable claim worth tracking. Lambert's hypothesis is that Chinese models will continue to look like the U.S. frontier of 3-9 months ago because the cultural difference favors fast-following over 0-to-1 research. If this is wrong, we should see a Chinese lab ship a paper that the U.S. ecosystem cannot reproduce within 6 months. ResRL itself is a candidate (open-source code, but the SVD-projection mechanism is non-obvious). The next-paper signal to watch.

→ Full summary

Industry Pulse

What is happening in AI beyond the lab.

Anthropic 80x growth vs infrastructure (The Decoder, via RSS). Decoder analysis of Anthropic's growth trajectory and the gap with their compute base. Confirms the capacity-crunch thesis from the Pragmatic Engineer piece.
Amazon lifts ban on Claude Code and Codex (Pragmatic Engineer). Amazon had banned external AI coding tools to push internal use of Kiro. Ban now lifted. Implication for the wiki's Anthropic-OpenAI services-companies thread (05-04), the enterprise procurement walls are coming down.
Apple lets slip it uses Claude Code (Pragmatic Engineer). Internal usage disclosure suggests Apple's AI strategy is more model-agnostic than the Apple Intelligence framing implies.
Meta forcefully assigns engineers to data labelling ahead of job cuts (Pragmatic Engineer). 20-40% of engineers in some teams given menial data-labelling work. The wiki has been tracking the agentic-data-scientist thread (Meta FAIR Autodata, 05-06). Net signal, even Meta is data-bottlenecked, and they're throwing humans at it.
DeepL layoffs (250 jobs) (The Decoder, via RSS). The first major translation-AI casualty in the wiki's recent ingest. Notable because DeepL was the canonical pre-LLM AI-product success story.
Claude's "dreaming" feature for agent background processing (The Decoder, via RSS). Anthropic announces background processing for long-running agent tasks. Conceptually adjacent to the silence-tax/SxS frame from today's "When to Think, When to Speak" Deep Dive.
Google DeepMind ↔ EVE Online studio partnership (The Decoder, via RSS). Game-environment-as-RL-environment continues the LWD Fleet RL VLA Policies (05-04) thread.
DeepMind union vote (AI Weekly, via Gmail). Union organizing inside Google DeepMind. First major frontier-lab labor-organizing signal the wiki has tracked. If the vote passes, governance / negotiating-power dynamics inside frontier labs change.
EU pushed AI deadline to 2027 (AI Weekly, via Gmail). The original EU AI Act enforcement deadlines have been pushed back. Regulatory schedule slip + AI capacity crunch is a coupled story: the regulatory window is shifting just as the deployment pressure is peaking.
Marcus on the OpenAI circular financing chart (Marcus on AI, via Gmail). Marcus surfaces a Bloomberg/Information chart of the AI financing circle (compute deals, equity stakes, customer commitments) and notes "they hadn't figured out how OpenAI would pay for it." The chart was already obsolete the day it was published (the xAI-Anthropic deal isn't on it). Tier 2 industry signal: the financing structure is moving faster than reporting can map.
Anthropic DMCA debacle ("You Can't Delete the Internet" via Medium digest, surfaced via Gmail). Anthropic issued a DMCA takedown on 8,100 GitHub repositories, allegedly via a single Korean-issued action. Tier 2 industry. Lateral to today's GitHub reliability crisis and the developer-trust theme.
Linux 7.1: Kicinski's "LLM-pocalypse" — 138,000 lines deleted (Medium digest title, via Gmail). Linux networking maintainer publicly described and removed LLM-generated kernel code at scale. Body not in the wiki yet, but the title alone is a Tier 2 responsible-ai signal. First time a major open-source maintainer has publicly framed LLM contributions as a quality crisis at this scale.
AI Weekly: Anthropic's biggest week of 2026 (AI Weekly, via Gmail). Roundup framing: 80x Q1 revenue + SpaceX compute + DeepMind union vote + EU AI deadline shift. Same week, same direction. Confirms the Anthropic-Colossus capacity story framing.

Connecting the Dots

The most valuable section of the digest. Synthesis nobody else can provide.

The RLVR failure-modes cluster across HF + Kurate + RSS. ResRL today (HF) plus three Kurate-rated underrated papers (LLMs Gaming Verifiers cs.LG #9, The Verification Tax cs.LG #10, AI Scientists Without Reasoning cs.AI #5) plus IatroBench (cs.AI #9, pre-registered evidence of iatrogenic harm from AI safety measures). Four papers, two ranking signals, one converged claim. RLVR's verifiable rewards are gameable, mode collapse is structural, and the field is iterating on fixes. ResRL is the first concrete fix the wiki has tracked. The Verification Tax + IatroBench are the negative-result counterparts. The next paper to watch, per Worth Watching below, is the first replication or refutation of ResRL's SVD-projection mechanism on a different model family or scale.
Meituan publishes ResRL the same week Lambert visits Meituan. This is the cleanest cross-source synthesis the wiki has produced. ResRL's first-author Zihan Lin is doing his work during an internship at Meituan. Lambert documents Meituan's culture and posture (open-weight LLMs from a delivery-services company, build-not-buy on RL data, students-as-peers). The two pieces are reading two views of the same lab in the same week. The wiki's value here is naming the connection explicitly, neither piece by itself surfaces this.
Anthropic-Colossus deal triple-source coverage. Pragmatic Engineer (capacity-crunch cause) + Simon Willison (environmental brand-risk consequence) + Lambert China piece ("Chinese devs are Claude-pilled despite the ban", oblique demand evidence). Three independent journalists, three different vantage points, one converging story. The wiki should treat this as Tier 1 industry not Industry Pulse precisely because of the cross-source convergence.
AI-infrastructure-stress as a category. Two Tier 1 industry stories in one day. One platform: Anthropic (Colossus 1 capacity-crunch). Other platform: GitHub (86% uptime, AI agent-fuelled load spike). Same root cause: AI workloads are outgrowing infrastructure faster than providers can scale. Both stories sourced from Pragmatic Engineer (Gergely Orosz) — he's been tracking this beat for two weeks. Worth Watching: whether a third major AI-infrastructure platform (Vercel, Cloudflare, AWS) reports similar stress within 30 days. If yes, AI-infrastructure-saturation becomes a Tier 1 industry trend, not isolated incidents.
The Netflix Routing gap. "State of Routing in Model Serving" (Netflix Tech Blog, Nipun Kumar et al) is exactly your Tier 1 #1 area but landed only via the Gmail Medium digest (title-level, no body). The pipeline currently surfaces three paper-source feeds (HF, Kurate, RSS) plus social, but Medium production-engineering posts from major industry tech blogs (Netflix, Uber, Airbnb, Pinterest) are not natively farmed. Worth filing as a known coverage gap. Stubbed at wiki/ai-routing/2026-05-08-netflix-state-of-routing-model-serving.md.
Source-curation cross-confirmation. Two of the resources I evaluated and integrated this morning (kurate.org and the awesome-foundation-agents repo) appeared in your retweets within hours, reposted by @robert_lauko and @tom_doerr respectively. Not signal about today's research, signal about the source-set itself. The wiki's daily-digest sources are picking up resources that your network independently surfaces. Worth filing.
Hallucination detection efficiency × disclosure policies. First Token Knows + SxS Interleaved Reasoning are not the same paper, but they are framing the same underlying problem. When does an LLM commit to an answer? First Token Knows says "the first content-bearing token's entropy already encodes the model's confidence in the answer, just read it." SxS says "make the timing of commitment a learned action, not an unmodeled side-effect of single-stream autoregressive generation." Both papers are independently strong. Together they make a stronger claim that the autoregressive-generation default needs to be unbundled, and the bundle is starting to come apart in different directions.

Worth Watching

Specific, falsifiable predictions. Tier 1 / Tier 2 priority.

First lab to replicate ResRL's SVD-projection mechanism beyond 7B class. ResRL's experiments cap at 7B-instruction-tuned. The mechanism is theoretically scale-free (gradient projection works the same at any layer width), but the empirical SVD subspace might be ill-conditioned at much larger scale. 90 days from publication is a reasonable window. If no replication appears by 2026-08-08, the silence is itself a signal.
First major Anthropic enterprise customer to pull due to Colossus environmental coverage. 60 days. The data-center-as-political-issue wave was already cresting. Anthropic just signed with the worst-record data center in the industry. Falsifiable.
Whether the alphaxiv enrichment in this digest measurably improves Deep Dive depth vs the abstract-only baseline. This is the first digest written with the alphaxiv overview integrated for Tier 2 papers. ResRL and OpenSearch-VL Deep Dives drew on alphaxiv overviews (4000 chars each). First Token Knows and When-to-Think did not (no overview yet). Compare the two pairs over the next two weeks for whether the alphaxiv-enriched Deep Dives are systematically more grounded.
Kurate's "LLM-rated underrated" cohort. Four Kurate-top papers missing from HF this week (LLMs Gaming Verifiers, The Verification Tax, AI Scientists Without Reasoning, IatroBench). Track whether any of them surface in HF over the next 1-2 weeks. If yes, Kurate's lead time is in the 7-10 day window. If no, Kurate is identifying papers HF will systematically miss.

Rising authors from Kurate

No authors crossed the threshold this run. The state file is one week old; the rising-author detection requires ≥3 top-10 appearances in a 4-week window. First detections are expected around 2026-05-22.

Quick Hits

MiniCPM-O 4.5: Real-Time Full-Duplex Omni-Modal Interaction. (Paper) Open-weight 8B-class omni-modal model from OpenBMB targeting real-time full-duplex audio + vision + text. Tier 3 for this wiki, but the full-duplex framing (concurrent input+output streams) is the architectural piece that connects to today's SxS disclosure-policy paper. Worth filing for whoever tracks open-multimodal weights.

Rethinking Reasoning-Intensive Retrieval (Yale NLP). (Paper) Argues that current retriever benchmarks (BRIGHT) evaluate retrievers in isolation rather than within the agentic-search loop they actually serve, and current synthetic training corpora encourage single-passage relevance over evidence-portfolio breadth. Tier 2 agentic-systems but inside a narrow sub-area. Already in the wiki at BRIGHT-Pro / RTriever (05-07).

SWE-WebDevBench: Coding Agent Application-Platforms Benchmark. (Paper) Tier 2 agentic-systems benchmark for full-stack web-app generation. Adds to the agent-benchmarks cluster.

CreativityBench: Agent Creative Reasoning via Tool Repurposing. (Paper) Tier 2 agentic-systems. Tests whether agents can use tools for purposes their creators did not intend. Connects laterally to the tool-calling concept page.

xAI deprecation notice for Grok 4.1 Fast (via Simon Willison RSS). Two weeks notice for retiring grok-4-1-fast-reasoning, grok-4-1-fast-non-reasoning, grok-4-fast-reasoning, grok-4-fast-non-reasoning, grok-4-0709, grok-code-fast-1, grok-3, and grok-imagine-image-pro. No migration path to a fast/cheap alternative. Customer-trust signal worth filing alongside the Anthropic dev-hostility coverage.

Sources ingested today: 22 HF papers (May 7) | 17 RSS posts (May 7) | 1 Twitter slot (10 curated retweets, 4 AI-handle tweets) | 40 Kurate ranked papers across cs.AI + cs.LG (weekly) | 7 starred Gmail items (post OAuth-recovery: Pragmatic Engineer GitHub crisis, Marcus circular financing, AI Weekly roundup, Lambert China dup, Medium digest with Netflix Routing + Linux LLM-pocalypse + Anthropic DMCA + Karpathy-wiki articles + Towards Data Science). Wiki pages updated: 7 new summary pages + index + log.