Monday, May 18, 2026 · social stream

Media Live

daily roll-up

Summary

The day is a clear evening-heavy slot. Morning was empty (zero retweets, zero articles, second consecutive Monday like that) and afternoon offered only one Tier 1 nugget. The evening slot carried the substance: five of six @bayesiansapien retweets are worth reading, anchored by Meta's SP-KV (Self-Pruned KV Attention) which trains a per-head utility predictor for KV eviction and claims 3 to 10x cache reduction. Two field-shaping arguments land alongside it: a Stanford Data Processing Inequality paper arguing a single LLM beats coordinated multi-agent systems under equal reasoning budgets, and dair.ai's Epistematics paper claiming most agent leaderboards do not measure what they advertise. The afternoon's single nugget is Atlas Inference clocking Qwen3.6-35B at 200+ tok/s on a DGX Spark (GB10), roughly 3x what Codex and Claude pipelines hit on the same hardware class. Everything from @brivael (30+ tweets across slots, mostly French-language polemics) and the @nvidia / @magicsilicon Dell-Tech-World promo cluster is noise.

Posts

SP-KV: Self-Pruned KV Attention from Meta, 3 to 10x KV cache reduction (@TheTuringPost) [evening]. Per-token per-head 2-layer MLP predicts utility, old tokens are pruned while a local sliding window stays full. Composes with Make Each Token Count's eviction policy and the KV-sharing / MHC line Raschka surveyed yesterday. Cleanest Tier 1 KV-cache story of the week.
Single LLM beats coordinated multi-agent under equal reasoning budgets (Stanford) (@rohanpaul_ai) [evening]. Formalizes the handoff-as-compression argument via the Data Processing Inequality. Reads as a coherent counter to the multi-agent default, alongside the LIFE survey on multi-agent collaboration failure and the multi-agent-systems concept page.
The Evaluation Trap / Epistematics: most agent leaderboards do not measure what you think (@dair_ai · arxiv 2605.14167) [evening]. Audit procedure that derives evaluation criteria from a benchmark's capability claim and checks whether the test discriminates the claim from proxy behaviors. Worked example shows Dupoux et al. (2026) reproducing the assumption it claims to revise.
SFT memorizes, RL generalizes (ICML 2025) (@burkov · arxiv 2501.17161) [evening]. Comparative study across rule-based textual and visual tasks. Empirical companion to GFT: SFT as degenerate RL, the theoretical version of the same claim.
Detecting overfitting during long-horizon grokking via Random Matrix Theory (@burkov · arxiv 2605.12394) [evening]. RMT spectra of weight matrices alone discriminate generalizing vs memorizing basins. Practitioner setting with no training history, no test set. Model-card-grade diagnostic if it holds up.
Atlas Inference clocks Qwen3.6-35B at 200+ tok/s on DGX Spark (GB10) (@Scobleizer reposting @AtlasInference) [afternoon]. Claim is roughly 3x Codex / Claude on the same hardware class. No paper or kernel detail in the post. Worth a follow-up if Atlas publishes methodology.
Hermes Agent Kanban: orchestrator auto-decomposition on triage (@Scobleizer · docs · PR #27572) [afternoon]. Orchestrator decomposes a triage prompt into subtasks and routes by specialization description, durable board in ~/.hermes/kanban.db, every worker an OS process. Adjacent to Claude Code vs Hermes permissions coverage.
Sholto Douglas reposts "How to land a frontier lab job" by Vlad Feinberg (@_sholtodouglas · vladfeinberg.com) [evening]. Anthropic-insider endorsement, career-side reading rather than research, high signal for the audience.
Claude Code at scale: best practices for monorepos, legacy systems, microservices (@ClaudeDevs · claude.com blog) [evening]. Anthropic's own write-up for million-line repos. Practitioner reading.
Grok Build beta first impressions (@brivael) [evening]. Hands-on note, speed is "genuinely cool", quality-at-speed claim would be a real IDE-agent shift if it lands near Opus 4.7. Anecdotal, no benchmarks.
Opaque x.com/i/article reposts (@nyk_builderz via bayesiansapien) [evening]. Content not fetchable. Click through to read.
@MillionInt productivity / math aphorisms (cluster of 2, @MillionInt) [morning]. Inspirational, no AI content. Skip.
@brivael French-language polemics on AI meritocracy, copycats, politics, Twitter drama (cluster of 30 across slots, @brivael) [afternoon + evening]. No links to research, no falsifiable claim. Skip.
Scoble personal feed and consumer biometric / BCI plugs (cluster of 6, @Scobleizer) [afternoon + evening]. Bill Gates time-value, Big Sur sunset, "AI is taking my job" essay, globaledentity.com vein-and-skeletal biometrics, Mave Health consumer BCI. No technical content. Skip.
NVIDIA at Dell Technologies World, Jensen on stage with Michael Dell (@nvidia keynote · @nvidia AI-and-routine-work clip) (cluster of 2) [evening]. Promotional. Skip.
INTC on the NYSE floor, Lip-Bu Tan on Mad Money (@magicsilicon) [evening]. Stock promo. Skip.
@BrettRatner Instagram reel (@BrettRatner) [afternoon]. Opaque link, no preview. Click through to read.

slot detail

Evening

scraped 2026-05-18 22:00 IST · 27 tweets · 6 curated

Summary

Five of the six @bayesiansapien retweets are pure substance and worth reading. The headliner is SP-KV from Meta (Self-Pruned KV Attention), which trains a tiny per-head utility predictor to evict tokens from the persistent KV cache while keeping a local sliding window, claiming 3 to 10x KV reduction. That sits directly on Tier 1. Two further retweets are field-shaping arguments rather than benchmarks: a Stanford paper using the Data Processing Inequality to argue that under equal reasoning budgets a single LLM beats coordinated multi-agent systems on multi-hop tasks, and dair.ai's "Evaluation Trap" / Epistematics paper claiming most agent leaderboards do not measure what we think they measure. Two older but-good methodology pieces round out the curated set: the ICML 2025 result that SFT memorizes while RL generalizes, and a grokking-detection method using random matrix theory on weight spectra alone. The AI account feed is mostly noise. One blog worth the click is Vlad Feinberg's "How to land a frontier lab job" reposted by Sholto Douglas. Everything from @brivael (15 tweets, mostly French political polemics) and the @nvidia / @magicsilicon Dell-Tech-World promos is skip.

Posts

SP-KV: Self-Pruned KV Attention from Meta, 3 to 10x KV cache reduction (@TheTuringPost). For every token and every head, a 2-layer MLP predicts a utility score; old tokens get pruned, while a local sliding window stays fully available for short-range interactions. Hybrid attention, learned eviction, persistent cache. This is the cleanest Tier 1 KV-cache story this week: it composes naturally with Make Each Token Count's eviction policy and the KV-sharing / MHC compressed attention line Raschka surveyed yesterday.
Single LLM beats coordinated multi-agent under equal reasoning budgets (Stanford) (@rohanpaul_ai). A single agent keeps the whole problem inside one chain of thought; a multi-agent system has to slice it into messages, summaries, and handoffs, and every handoff is a compression step. The paper formalizes this via the Data Processing Inequality: once information is dropped at a handoff, downstream agents cannot recover it. Read alongside the LIFE survey on multi-agent collaboration failure and the multi-agent-systems concept page — a coherent argument is now building that the multi-agent default is overrated and orchestrator-monolith design wins on reasoning-bound tasks.
The Evaluation Trap / Epistematics: most agent leaderboards do not measure what you think (@dair_ai · arxiv 2605.14167). The paper introduces an audit procedure that derives evaluation criteria directly from a benchmark's technical capability claim, then checks whether the proposed test discriminates the claim from proxy behaviors that merely correlate with it. Worked example audits Dupoux et al. (2026) and shows the benchmark reproduces the very assumption it claims to revise. If even half the agent benchmarks fail this audit, a lot of model-selection decisions are downstream of self-confirming proxies.
SFT memorizes, RL generalizes (ICML 2025) (@burkov · arxiv 2501.17161). Comparative study across rule-based textual and visual tasks: RL post-training produces transferable rule-following, SFT does not. Strong empirical companion to GFT: SFT as degenerate RL, which made the theoretical version of the same claim. Worth keeping in mind every time someone proposes SFT-first as a cheap shortcut for capabilities work.
Detecting overfitting during long-horizon grokking using Random Matrix Theory (@burkov · arxiv 2605.12394). The setup is the one practitioners actually face: you have weights, no training history, no test set, no idea whether the model truly generalized or got stuck in a fragile memorizing basin. The authors use RMT spectra of the weight matrices alone to discriminate the two regimes. If it holds up, this is a model-card-grade diagnostic — particularly useful when judging open-weights releases without trusting the lab's evals.
Sholto Douglas reposts "How to land a frontier lab job" by Vlad Feinberg (@_sholtodouglas · vladfeinberg.com). Endorsement from an Anthropic insider with "10/10 no notes." Career-side reading, not research, but high signal for the audience.
Claude Code at scale: best practices for monorepos, legacy systems, microservices (@ClaudeDevs · claude.com blog). Anthropic's own write-up on what works when Claude Code lands in million-line repos. Practitioner reading, useful if you have a real codebase to deploy against rather than a toy.
Grok Build beta first impressions (@brivael). One-paragraph hands-on note: UX is nice, model speed is "genuinely cool," and if hard-task quality lands at or near Opus 4.7 at that speed it would be a real shift in the IDE-agent market. Anecdotal, no benchmarks, but the speed claim is worth watching.
Opaque x.com/i/article reposts (@nyk_builderz via bayesiansapien). Long-form X article, content not fetchable. Click through to read.
NVIDIA at Dell Technologies World — Jensen on stage with Michael Dell (@nvidia keynote · @nvidia AI-and-routine-work clip) (cluster of 2). Keynote and clip both promotional, no new technical claim. Skip.
INTC on the NYSE floor, Lip-Bu Tan on Mad Money (@magicsilicon). Pure stock-promo post. Skip.
Scoble tries a consumer BCI from Mave Health (@Scobleizer). Founder demo of a consumer brain-computer-interface device that "makes your brain better" with a few minutes of daily wear. Consumer hardware pitch, no AI angle. Skip.
@brivael French-language political and meta-Twitter posts (cluster of 14) (thread root). Long sequence of polemics on capitalism, communism, French intellectual history, and Twitter drama. No AI content. Skip.

Afternoon

scraped 2026-05-18 15:00 IST · 25 tweets

Summary

Low-signal slot, two items worth attention buried in a long tail of personal and French-language opinion threads. The single Tier-1 nugget is Scoble flagging an Atlas Inference benchmark of Qwen3.6-35B at 200+ tok/s on an NVIDIA DGX Spark (GB10), roughly 3x what Codex and Claude pipelines hit on the same class of box; if the number holds it is a real GPU-inference story. Scoble also surfaced the new Hermes Agent Kanban release where the orchestrator auto-decomposes a triage prompt into subtasks and routes them to typed agent profiles, which is the kind of multi-agent board pattern worth tracking. Everything else is filler: a cluster of 16 @brivael posts (cluster of 16) is a French-language meta-fight about AI, copycats, and intellectual authority with no technical content, and the rest of Scoble's feed is personal essays and a friend's identity-verification startup pitch.

Posts

Atlas Inference clocks Qwen3.6-35B at 200+ tok/s on DGX Spark (GB10) (@Scobleizer reposting @AtlasInference). Claim is roughly 3x what Codex and Claude get on the same hardware class. No paper, no kernel detail, no methodology in the post itself. Worth a follow-up read if Atlas publishes anything substantive on what they are doing differently on GB10.
Hermes Agent Kanban: orchestrator auto-decomposition on triage (@Scobleizer reposting Teknium · docs · PR #27572). Drop one prompt into triage, orchestrator agent decomposes it into subtasks and assigns each to a named agent profile based on specialization descriptions. Durable task board in ~/.hermes/kanban.db, every worker is its own OS process, agents drive via kanban_* tools. Adjacent to prior Claude-vs-Hermes coverage in agentic-systems.
@brivael French-language self-promotion and AI-meritocracy thread (cluster of 16, @brivael). Continuous self-replies and quote-tweets defending himself against copycat critiques, arguing AI is a meritocracy amplifier where talent x effort now gets multiplied by AI leverage, plus a side argument that French intellectuals are about to get eaten by builders. No links to research, no falsifiable claim. Skip.
Scoble personal feed: Bill Gates time-value, Big Sur sunset, "AI is taking my job" essay (cluster of 4, @Scobleizer). Anecdote-heavy reflections, no technical content. Skip.
Scoble plugs friend's vein-and-skeletal biometric startup globaledentity.com (@Scobleizer · globaledentity.com). TSA airport security application, multi-factor identity via vein and skeletal scans. Adjacent to identity and biometric AI rather than core wiki topics. Skip.
@BrettRatner Instagram reel (@BrettRatner). Opaque Instagram link with no preview text. Click through to read.

Morning

scraped 2026-05-18 09:12 IST · 2 tweets

Media Live | 2026-05-18 morning slot

Source: raw/twitter/2026-05-18-morning.json, raw/twitter/2026-05-18-morning.md Scraped: 2026-05-18 09:12 IST | 24h lookback | 2 tweets | 0 retweets | 0 articles

Summary

The morning slot is the quietest of the week and the second consecutive Monday-morning slot with zero curated retweets and zero attached articles. The two tweets that did surface are inspirational aphorisms from @MillionInt with no AI-research content. There is no substance to synthesize.

This is the second slot in two weeks where the curated signal is empty. The pattern, if it continues for a third week, is worth surfacing as a config question: either the lookback window for @bayesiansapien's retweets needs to extend through the weekend on Monday mornings, or the Monday morning slot is genuinely a low-signal slot and should be deprioritised in the cron schedule.

Signal review

@bayesiansapien retweets: none in the past 24h.
AI handle feed: 2 tweets, both from @MillionInt (Core Automation). Neither is on an AI topic the wiki tracks.
- "Mathematics is a science of transforming objects you're working with until the answer to the question you're trying to answer becomes trivial" (tweet)
- "Best productivity hack I know is organizing your work so that you enjoy it the most" (tweet)
Articles captured: 0.
Images captured: 0.

Action

None for today. Roll this slot's signal forward to the afternoon slot if curated retweets appear later in the day. Note in the 2026-05-18 daily digest Industry Pulse that the twitter morning slot was empty for the second consecutive Monday.