2026-05-18-evening

Summary

Five of the six @bayesiansapien retweets are pure substance and worth reading. The headliner is SP-KV from Meta (Self-Pruned KV Attention), which trains a tiny per-head utility predictor to evict tokens from the persistent KV cache while keeping a local sliding window, claiming 3 to 10x KV reduction. That sits directly on Tier 1. Two further retweets are field-shaping arguments rather than benchmarks: a Stanford paper using the Data Processing Inequality to argue that under equal reasoning budgets a single LLM beats coordinated multi-agent systems on multi-hop tasks, and dair.ai's "Evaluation Trap" / Epistematics paper claiming most agent leaderboards do not measure what we think they measure. Two older but-good methodology pieces round out the curated set: the ICML 2025 result that SFT memorizes while RL generalizes, and a grokking-detection method using random matrix theory on weight spectra alone. The AI account feed is mostly noise. One blog worth the click is Vlad Feinberg's "How to land a frontier lab job" reposted by Sholto Douglas. Everything from @brivael (15 tweets, mostly French political polemics) and the @nvidia / @magicsilicon Dell-Tech-World promos is skip.

Posts

SP-KV: Self-Pruned KV Attention from Meta, 3 to 10x KV cache reduction (@TheTuringPost). For every token and every head, a 2-layer MLP predicts a utility score; old tokens get pruned, while a local sliding window stays fully available for short-range interactions. Hybrid attention, learned eviction, persistent cache. This is the cleanest Tier 1 KV-cache story this week: it composes naturally with Make Each Token Count's eviction policy and the KV-sharing / MHC compressed attention line Raschka surveyed yesterday.
Single LLM beats coordinated multi-agent under equal reasoning budgets (Stanford) (@rohanpaul_ai). A single agent keeps the whole problem inside one chain of thought; a multi-agent system has to slice it into messages, summaries, and handoffs, and every handoff is a compression step. The paper formalizes this via the Data Processing Inequality: once information is dropped at a handoff, downstream agents cannot recover it. Read alongside the LIFE survey on multi-agent collaboration failure and the multi-agent-systems concept page — a coherent argument is now building that the multi-agent default is overrated and orchestrator-monolith design wins on reasoning-bound tasks.
The Evaluation Trap / Epistematics: most agent leaderboards do not measure what you think (@dair_ai · arxiv 2605.14167). The paper introduces an audit procedure that derives evaluation criteria directly from a benchmark's technical capability claim, then checks whether the proposed test discriminates the claim from proxy behaviors that merely correlate with it. Worked example audits Dupoux et al. (2026) and shows the benchmark reproduces the very assumption it claims to revise. If even half the agent benchmarks fail this audit, a lot of model-selection decisions are downstream of self-confirming proxies.
SFT memorizes, RL generalizes (ICML 2025) (@burkov · arxiv 2501.17161). Comparative study across rule-based textual and visual tasks: RL post-training produces transferable rule-following, SFT does not. Strong empirical companion to GFT: SFT as degenerate RL, which made the theoretical version of the same claim. Worth keeping in mind every time someone proposes SFT-first as a cheap shortcut for capabilities work.
Detecting overfitting during long-horizon grokking using Random Matrix Theory (@burkov · arxiv 2605.12394). The setup is the one practitioners actually face: you have weights, no training history, no test set, no idea whether the model truly generalized or got stuck in a fragile memorizing basin. The authors use RMT spectra of the weight matrices alone to discriminate the two regimes. If it holds up, this is a model-card-grade diagnostic — particularly useful when judging open-weights releases without trusting the lab's evals.
Sholto Douglas reposts "How to land a frontier lab job" by Vlad Feinberg (@_sholtodouglas · vladfeinberg.com). Endorsement from an Anthropic insider with "10/10 no notes." Career-side reading, not research, but high signal for the audience.
Claude Code at scale: best practices for monorepos, legacy systems, microservices (@ClaudeDevs · claude.com blog). Anthropic's own write-up on what works when Claude Code lands in million-line repos. Practitioner reading, useful if you have a real codebase to deploy against rather than a toy.
Grok Build beta first impressions (@brivael). One-paragraph hands-on note: UX is nice, model speed is "genuinely cool," and if hard-task quality lands at or near Opus 4.7 at that speed it would be a real shift in the IDE-agent market. Anecdotal, no benchmarks, but the speed claim is worth watching.
Opaque x.com/i/article reposts (@nyk_builderz via bayesiansapien). Long-form X article, content not fetchable. Click through to read.
NVIDIA at Dell Technologies World — Jensen on stage with Michael Dell (@nvidia keynote · @nvidia AI-and-routine-work clip) (cluster of 2). Keynote and clip both promotional, no new technical claim. Skip.
INTC on the NYSE floor, Lip-Bu Tan on Mad Money (@magicsilicon). Pure stock-promo post. Skip.
Scoble tries a consumer BCI from Mave Health (@Scobleizer). Founder demo of a consumer brain-computer-interface device that "makes your brain better" with a few minutes of daily wear. Consumer hardware pitch, no AI angle. Skip.
@brivael French-language political and meta-Twitter posts (cluster of 14) (thread root). Long sequence of polemics on capitalism, communism, French intellectual history, and Twitter drama. No AI content. Skip.