Monday, May 11, 2026 · social stream

Media Live

daily roll-up

Summary

The day splits into two substantive signals and a long tail of low-value content. The morning slot is anchored by @burkov's retweet of the Sakana Conductor paper, a 7B RL orchestrator that routes between GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro at roughly 3 calls per question and beats every individual frontier model on GPQA-D, LiveCodeBench, and AIME25. The same paper is independently surfaced in today's DAIR.AI weekly Gmail, which makes it cross-source confirmed and the load-bearing item of the day for routing. The evening slot's only real signal is an Apple paper, surfaced via @omarsar0, on moving tool-call evaluation inside the execution loop with a reviewer agent and Helpfulness-Harmfulness metrics. A weak 4-post cluster on Jensen Huang's CMU commencement runs through the morning, plus six opaque x.com/i/article/ reposts from @bayesiansapien that the pipeline cannot resolve. Afternoon is empty.

Posts

Sakana Conductor (cross-source confirmed, HF + DAIR.AI Gmail) (@burkov · arXiv 2512.04388 · ChapterPal summary) [morning]. 7B RL policy orchestrates GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro by writing NL subtasks and assigning workers, beats every individual model on GPQA-D, LiveCodeBench, and AIME25 at ~3 calls per question. See Conductor summary.
In-loop reviewer agent for tool-calling (Apple) (@omarsar0 · arXiv 2604.27233) [evening]. A reviewer agent inspects each provisional tool call before execution and injects feedback; the paper proposes Helpfulness and Harmfulness metrics to measure whether the reviewer fixes more errors than it creates. Reports +5.5% on BFCL irrelevance detection and +7.1% on Tau2-Bench multi-turn. Tracks against tool-calling.
Clarity-of-writing regression in long-form AI text (@MillionInt) [morning]. Argues clarity may have regressed from the o1/o3 era despite intelligence gains. Solo signal, no paper. Worth tracking as a falsifiable claim if anyone benchmarks it.
LLM Wikis + HTML artifacts as a workflow (@omarsar0) [evening]. Personal-workflow opinion piece arguing an LLM wiki is the durable state for agents, with HTML artifacts as the interactive surface. Directionally relevant because cere-bro is exactly this pattern.
Jensen Huang at CMU commencement (cluster of 4) (@nvidia 23:29 UTC, @nvidia 23:25 UTC, @magicsilicon 20:31 UTC, @magicsilicon 19:16 UTC · NVIDIA blog · CMU news) [morning]. Honorary Doctor of Science and Technology, keynote framed AI revolution alongside PC revolution. PR cycle, included as cluster.
Opaque x.com/i/article/ reposts (click through to read) — @neural_avb, @eng_khairallah1, @_avichawla, @akshay_pachaar, @ashwingop, @addyosmani [morning]. All retweeted by @bayesiansapien; pipeline cannot resolve the embedded article IDs. Treated as curated click-through items.
DAIR.AI Vibe Coding Claude Code course (landing page) [evening]. Promo bundled into the Apple-paper tweet. Skip.
Tesla Smart Summon FSD v14.3.2 clip (@Tesla) [evening]. Product demo, no model or training detail. Skip.
@bcherny — Clawd + umeshu (@bcherny) [morning]. Off-topic personal post from the Anthropic feed. Skip.

slot detail

Evening

scraped 2026-05-11 22:00 IST · 3 tweets · 2 curated

Summary

A thin evening slot dominated by one substantive signal: an Apple paper on moving tool-call evaluation inside the execution loop, surfaced via an omarsar0 repost. The paper proposes a reviewer agent that inspects each provisional tool call before execution and introduces Helpfulness-Harmfulness metrics to quantify whether the reviewer fixes more errors than it creates. The rest of the slot is filler: an older repost on LLM Wikis plus HTML artifacts as a personal workflow primitive, a DAIR.AI course landing page bundled with the Apple tweet, and one Tesla Smart Summon clip that has no AI research content. Read the Apple paper, skip the rest.

Posts

In-loop reviewer agent for tool-calling (Apple) (@omarsar0 · paper). Moves agent evaluation from post-hoc trajectory analysis to inference-time intervention: a reviewer agent inspects each provisional tool call, injects feedback when it spots an error, and the primary agent revises before the call ships. They introduce Helpfulness (percent of base errors corrected) and Harmfulness (percent of correct calls degraded) to make the reviewer-as-net-positive question measurable. Reports +5.5% on BFCL irrelevance detection and +7.1% on Tau2-Bench multi-turn. Worth tracking against tool-calling work.
LLM Wikis + HTML artifacts as a workflow (@omarsar0). Argues that an LLM wiki captures the durable state your agents need, and HTML artifacts on top turn that state into interactive surfaces that both you and the agents can act on. Personal-workflow opinion piece, not a paper; relevant directionally because cere-bro is exactly this pattern.
DAIR.AI Vibe Coding Claude Code course (landing page). Promo for a paid Claude Code course bundled into the Apple-paper tweet. Skip.
Tesla Smart Summon clip (FSD v14.3.2) (@Tesla). Owner video of Smart Summon working in heavy rain, pull-over behavior matching Robotaxi. Product demo, no model or training detail. Skip.

Afternoon

scraped 2026-05-11 15:12 IST · 0 tweets

Summary

Empty slot. No curated retweets from @bayesiansapien and no AI-relevant posts from the tracked handles in the past 24h window as of 15:12 IST. The morning slot already absorbed the day's signal, anchored by the Sakana Conductor paper (7B RL orchestrator beats individual frontier models at ~3 calls per question), cross-source confirmed against the DAIR.AI weekly. See the morning synthesis for the full picture, or wait for the evening slot.

Posts

No posts in this slot.

Morning

scraped 2026-05-11 09:00 IST · 13 tweets · 7 curated

Summary

Strongest signal of the slot is @burkov's retweet of the Sakana Conductor paper (7B RL orchestrator beats every individual frontier model at ~3 calls per question). The same paper appears in this morning's DAIR.AI weekly Gmail email, which makes it cross-source confirmed and the load-bearing item of the day. A small cluster of 4 posts (cluster of 4) covers Jensen Huang's CMU commencement and honorary doctorate (@nvidia x2, @magicsilicon x2). The rest of the @bayesiansapien retweet feed is six opaque x.com/i/article/ reposts that the pipeline cannot resolve, framed as click-through-to-read. @MillionInt flags a clarity-of-writing regression in long-form AI text, which is the only non-cluster, non-Conductor signal worth pulling out.

Posts

Sakana Conductor (cross-source confirmed) (@burkov via @bayesiansapien · arXiv 2512.04388 · ChapterPal summary). 7B RL policy orchestrates GPT-5, Claude Sonnet 4, Gemini 2.5 Pro by writing NL subtasks and assigning workers, beats every individual model on GPQA-D, LiveCodeBench, AIME25 at ~3 calls per question. Also surfaced in today's DAIR.AI weekly Gmail. See wiki summary.
Jensen Huang at CMU commencement (cluster of 4) (@nvidia 23:29 UTC, @nvidia 23:25 UTC, @magicsilicon 20:31 UTC, @magicsilicon 19:16 UTC · NVIDIA blog · CMU news). Jensen received an honorary Doctor of Science and Technology, gave the keynote, framed AI revolution alongside PC revolution. PR-cycle event, included as cluster.
Clarity-of-writing regression in long-form AI text (@MillionInt). Argues clarity may have regressed from the o1 / o3 era despite intelligence gains. Solo signal, no paper attached. Worth tracking as a falsifiable claim if anyone produces a measurement against an OpenAI Writing Quality benchmark.
Six opaque x.com/i/article/ reposts (click through to read) — @neural_avb, @eng_khairallah1, @_avichawla, @akshay_pachaar, @ashwingop, @addyosmani. All retweeted by @bayesiansapien, all link to x.com/i/article/ IDs the pipeline cannot resolve to readable URLs. Treated as curated click-through-to-read items.
@bcherny — Clawd + umeshu (@bcherny). Off-topic personal post from the Anthropic feed. Skip.