Tuesday, May 12, 2026 · social stream

Media Live

daily roll-up

Summary

The day's strongest cross-slot cluster is Claude Code's internal architecture. The morning surfaced @bcherny's launch of the Claude Code agent view (a single list of all in-flight sessions), the afternoon carried a public 5-layer field guide thread, and the evening closed with Gary Marcus reading Claude Code's 53 symbolic tools and ~500k lines of scaffolding as vindication for neurosymbolic AI. Three independent angles on the same codebase in one day is a real signal, not noise. The standout single-slot post is AutoTTS in the afternoon, with both the authors' thread and an analytical writeup arriving inside a few hours: $39.9 and ~160 minutes of agent search beats hand-crafted test-time scaling baselines, and the wiki already has a page on it from yesterday so this is cross-source confirmation. The rest of the afternoon is unusually dense for one slot, with PwC's clarification-timing paper, the tool-calling steerability probe, Thinking Machines' always-on Interaction Models, and the Microsoft + Salesforce 200K-conversation drift study all landing in the same window. Morning and evening are otherwise thin, with the @bcherny Cowork-books-flights cluster as the only other workflow-grade post, plus AWS Claude Platform GA in industry news. Several opaque x.com/i/article reposts go uncollapsed because the synthesis cannot expand them inline.

Posts

Claude Code agent view (research preview) (@bcherny · @claudeai launch) [morning]. Unified list of all in-flight Claude Code sessions instead of cycling between terminal tabs. Productizes the many-agents-per-user pattern.
Claude Code's 5 architectural layers (@NainsiDwiv50980 · wiki) [afternoon]. Field guide thread on CLAUDE.md as memory layer plus four further layers that go beyond prompting. Clean public summary of material already in the wiki's Claude Code pages.
Gary Marcus: Claude Code is the most neurosymbolic system he has ever seen (@GaryMarcus · ccunpacked.dev · wiki) [evening]. Reads Claude Code's 53 tools plus ~500k lines of orchestration around a frontier LLM as proof that progress is coming from classical-AI scaffolding, not pure scaling. The linked site is a source-level dissection of the agent loop and tool registry.
@bcherny Cowork + Opus 4.7 one-shots 8 flights and 5 hotels (cluster of 2) (@bcherny a, @bcherny b) [morning]. Flight preferences go into Cowork instructions, Opus opens a browser, navigates sites, books everything in parallel while the user does other Claude Code work. Frontier-agent browser use is crossing into real workflows.
AutoTTS, frontier LLMs design their own test-time scaling (cluster of 2) (@zhengtoong, @omarsar0 · arxiv · wiki) [afternoon]. Environment-driven discovery framework: humans design the search environment, coding agents discover the width-depth TTS controller. Discovery cost $39.9 and ~160 minutes, results generalize across held-out benchmarks and model scales. Second day of independent signal.
Clarification timing in long-horizon agents (PwC) (@dair_ai · arxiv) [afternoon]. Forced-injection framework across 4 frontier models, 84 tasks, 6,000+ runs. Goal clarification loses almost all value after 10% of execution. Deferring past mid-trajectory is worse than never asking. Empirical brake on the "always ask early" prior.
Tool calling is linearly readable and steerable (@tldr_ai_papers · arxiv · wiki) [afternoon]. Probes 12 instruction-tuned models (270M to 27B). Adding the mean activation difference between two tools flips the chosen tool with 77-100% accuracy and the JSON arguments autoregressively conform to the new schema. Small set of mid- and late-layer attention heads localized via patching.
Thinking Machines Interaction Models (TML-Interaction-Small) (@rohanpaul_ai · blog) [afternoon]. 276B MoE, 12B active. Replaces walkie-talkie turn-taking with always-present AI: audio, video, and text sliced into 200ms micro-turns, model listens, watches, speaks, acts, and tool-calls while the interaction is still happening. Trained from scratch with a multi-stream micro-turn design.
RAO: Recursive Agent Optimization (@apurvasgandhi) [afternoon]. End-to-end RL for training LLMs to spawn, delegate to, and coordinate with recursive copies of themselves. Sub-agents as inference-time scaling primitives. Adjacent to the Sakana Conductor and AutoTTS thread on learned orchestration.
Anthropic memory plus "Dreaming" continual-learning preview (@daniel_mac8) [afternoon]. Reports on a recent talk framing memory as the next first-class agent primitive after MCP, Skills, and harnesses: writable shared context, provenance, review, background consolidation. "Dreaming" is described as recursive self-improvement at the agent-system level.
GEPA explainer on long-horizon agent RL (@blc_16) [afternoon]. Walkthrough of why sparse rewards throw away trajectory information and how GEPA learns from the trajectory itself via textual critiques, prompt edits, and Pareto-frontier selection.
Microsoft + Salesforce: 200K conversations, 39% average accuracy degradation (@HowToAI_) [afternoon]. ChatGPT 96.6% to 72.6%, Gemini 97.4% to 68.1% as conversations lengthen. Attributed to an anchoring trap. Mechanism overlaps with the PwC clarification-timing paper in the same slot.
Curved geometry of LLM activations (@che_shr_cat) [afternoon]. Argues the Linear Representation Hypothesis is a useful lie that breaks down fast: straight-line steering produces teleportation and diversity collapse. Conceptually opposite to the tool-steerability probe in the same slot.
Nature Neuroscience: brains do not predict every word uniformly (@ValerioCapraro) [afternoon]. Zou, Poeppel, Ding: brain activity tracks word surprisal LLM-style inside phrases but the match weakens across major phrase boundaries. Counterweight to the "humans are just next-word predictors" frame.
Claude Platform on AWS GA (@mattsgarman · AWS blog) [morning]. Anthropic's native Claude Platform, including Managed Agents, Agent Skills, MCP connector, code execution, and files API, accessible directly from AWS accounts. AWS is the first cloud provider to offer it natively. Also in today's Industry Pulse.
NVIDIA at Dell Technologies World (cluster of 2) (@nvidia a, @nvidia b · event) [morning]. Jensen Huang and Michael Dell co-keynote on AI-accelerated enterprise compute, May 18-21 Las Vegas. PR-cycle event.
Opaque x.com/i/article reposts (click through) (@AmarSVS, @AlphaSignalAI, @neural_avb, @ns123abc) [afternoon + evening]. Bare X-native long-form article links the synthesis cannot expand inline.
@magicsilicon "Whoa" (@magicsilicon) [afternoon]. Reaction post, no content. Skip.

slot detail

Evening

scraped 2026-05-12 22:00 IST · 2 tweets · 1 curated

Summary

A very thin evening slot with one signal post and one opaque link. Gary Marcus reposted (via bayesiansapien) a victory-lap take on Claude Code: 53 symbolic tools, ~500k lines of symbolic code wrapped around an LLM, which he frames as vindication for neurosymbolic AI rather than pure-LLM scaling. The pointer is to ccunpacked.dev, a source-level walkthrough of Claude Code's agent loop, tool system, and multi-agent orchestration. Everything else in the slot is one undecodable x.com article link.

Posts

Gary Marcus: Claude Code is the most neurosymbolic system he has ever seen (@GaryMarcus · ccunpacked.dev). Marcus reads Claude Code's 53 tools plus ~500k lines of orchestration code around a frontier LLM as proof that progress is coming from borrowing classical-AI and CS scaffolding, not from scaling LLMs alone. The linked site is a source-level dissection of the agent loop, tool registry, and unreleased features. Worth a click for the architecture explorer alone. See Claude Code architecture (04-17) and the 04-19 follow-up for prior wiki notes on the same codebase.
Opaque article repost (@ns123abc). Bare x.com/i/article/ link with no extracted content. Click through to read.

Afternoon

scraped 2026-05-12 15:00 IST · 16 tweets · 15 curated

Summary

Dense afternoon, almost entirely from @bayesiansapien curated reposts (15 RTs, only one weak post from the AI handle feed). The single strongest signal is AutoTTS (cluster of 2): @zhengtoong's authors' thread plus @omarsar0's analytical writeup of the same paper, both framing it as the moment where humans stop hand-tuning test-time scaling and instead build environments where LLM agents discover the strategy ($39.9, ~160 min to beat hand-crafted baselines). The wiki already has a page on it from yesterday, so this is cross-source confirmation rather than a new signal. Beyond AutoTTS, the slot carries a Thinking Machines release of "interaction models" (TML-Interaction-Small, 276B MoE, 12B active, micro-turn always-on conversation), a PwC paper that breaks the "ask early" intuition for agent clarification (goal-clarifications lose nearly all value past 10% of trajectory), a steerability result showing tool-calling decisions are linearly readable in 12 instruction-tuned models, an Anthropic talk teasing memory + a "Dreaming" feature as the next first-class primitive, and a Microsoft/Salesforce 200K-conversation analysis claiming an average 39% accuracy drop as conversations lengthen. Three reposts are opaque x.com/i/article/ links the reader has to click through to read.

Posts

AutoTTS — frontier LLMs design their own test-time scaling strategies (cluster of 2) (@zhengtoong 01:38 UTC, @omarsar0 23:19 UTC · arxiv · wiki). Environment-driven discovery framework where humans design the search environment and coding agents discover the width-depth TTS controller. Total discovery cost $39.9 and ~160 minutes, results generalize to held-out benchmarks and model scales. Second day of independent signal on this paper.
Clarification timing in long-horizon agents (PwC) (@dair_ai · arxiv). Forced-injection framework across 4 frontier models, 84 task variants, 6,000+ runs. Goal clarification loses nearly all value after 10% of execution (pass@3 drops from 0.78 to baseline); input clarification holds through ~50%; deferring past mid-trajectory is worse than never asking. No current frontier model asks inside the empirically optimal window. Empirical brake on the "always ask early" prior.
Tool calling is linearly readable and steerable (@tldr_ai_papers · arxiv). Probes 12 instruction-tuned models (Gemma 3, Qwen 3, Qwen 2.5, Llama 3.1, 270M to 27B). Adding the mean-difference between two tools' average activations flips the chosen tool with 77-100% accuracy on name-only prompts (93-100% at 4B+), and the JSON arguments autoregressively conform to the new tool's schema. Small set of mid- and late-layer attention heads localized via patching. Mechanistic handle on tool-selection failure modes, directly relevant to tool calling.
Thinking Machines — Interaction Models (TML-Interaction-Small) (@rohanpaul_ai · blog). 276B MoE, 12B active. Replaces walkie-talkie turn-taking with always-present AI: audio/video/text sliced into 200ms micro-turns, model listens-watches-speaks-acts-tool-calls while the interaction is still happening. Trained from scratch with a multi-stream micro-turn design. First production research preview of "interactivity scales alongside intelligence" as a thesis.
RAO: Recursive Agent Optimization (@apurvasgandhi). End-to-end RL for training LLMs to spawn, delegate to, and coordinate with recursive copies of themselves. Frames sub-agents as an inference-time scaling primitive (working memory, parallel decomposition) and the question as how to train the parent to exploit them. Closely adjacent to the Sakana Conductor + AutoTTS thread on learned orchestration.
Anthropic memory + "Dreaming" (continual learning preview) (@daniel_mac8). Reports on a recent Anthropic talk framing memory as the next first-class agent primitive after MCP, Skills, and harnesses: writable shared context, provenance, review, background consolidation. "Dreaming" is described as recursive self-improvement at the agent-system level, an early form of continual learning that becomes load-bearing once infinite context arrives.
GEPA explainer — RL struggles with long-horizon agents (@blc_16). Walkthrough of why sparse rewards throw away trajectory information and how GEPA learns from the trajectory itself via textual critiques, prompt edits, and Pareto-frontier selection across exploration and exploitation. Useful framing of the prompt-optimizer-as-credit-assigner direction.
Microsoft + Salesforce — 200K conversations, 39% average degradation (@HowToAI_). Cited numbers: ChatGPT 96.6% to 72.6%, Gemini 97.4% to 68.1% as conversations lengthen. Attributed to an "anchoring trap": models commit to wrong assumptions early and cannot recover. Popular framing of the multi-turn drift problem; the underlying mechanism overlaps with the PwC clarification-timing paper in the same slot.
Curved geometry of LLM activations (@che_shr_cat). Argues the Linear Representation Hypothesis is a useful lie that breaks down fast: straight-line steering in flat Euclidean space produces "teleportation" and diversity collapse, and the real geometry is curved. Conceptually adjacent to the tool-steerability paper above but pointing in the opposite direction on whether linear edits suffice.
Nature Neuroscience — brains do not predict every word uniformly (@ValerioCapraro). Zou, Poeppel, Ding: brain activity tracks word surprisal LLM-style inside phrases but the match weakens across major phrase boundaries. Prediction is constrained by linguistic structure. A counterweight to the "humans are just next-word predictors" frame.
Claude Code's 5 architectural layers (@NainsiDwiv50980). Field guide thread covering CLAUDE.md as memory layer, plus four further layers that have nothing to do with prompting. Most of the substance is already captured in the wiki's Claude Code architecture pages, but the framing is a clean public summary.
Opaque x.com/i/article reposts (click through to read) (@AmarSVS, @AlphaSignalAI, @neural_avb). Curated retweets pointing to X-native long-form articles the synthesis cannot expand inline.
@magicsilicon "Whoa 😳" (@magicsilicon). Reaction post, no content. Skip.

Morning

scraped 2026-05-12 09:00 IST · 6 tweets

Summary

Quiet morning slot, no @bayesiansapien retweets. The signal cluster is @bcherny (Anthropic) reporting that Claude Cowork with Opus 4.7 one-shot booked 8 flights and 5 hotels from natural-language preferences (cluster of 2 posts on the same workflow). Same handle also flags the new Claude Code "agent view" research preview, which is the productized version of the multi-agent orchestration thread the wiki has been tracking since Sakana Conductor (2026-05-11). Beyond that: AWS announces Claude Platform GA on AWS (first cloud provider to offer Anthropic's native platform through customer accounts), and a small NVIDIA-Dell PR cluster around Jensen + Michael Dell at Dell Technologies World. Reader can stop here for the gist.

Posts

@bcherny on Cowork + Opus 4.7 booking flights end-to-end (cluster of 2) (@bcherny 00:22 UTC, @bcherny 00:22 UTC). Puts flight preferences into Cowork instructions, Opus opens browser, navigates sites, books 8 flights and 5 hotels in parallel with the user doing other work in Claude Code. Frames Opus 4.7 as the first version to one-shot the booking task. Practitioner confirmation that frontier-agent browser-use is crossing into real workflows, not just demos.
Claude Code agent view (research preview) (@bcherny 23:35 UTC · referencing @claudeai launch tweet). One unified list of all in-flight Claude Code sessions instead of cycling between terminal tabs. Productizes the many-agents-per-user pattern. The mobile/desktop-agent surface is consolidating: paired with X-OmniClaw in today's digest.
Claude Platform on AWS (GA, AWS = first cloud) (@mattsgarman 19:11 UTC · AWS blog). Anthropic's native Claude Platform, including Managed Agents, Agent Skills, MCP connector, code execution, files API, accessible directly from AWS accounts. No separate credentials. Tracked in today's Industry Pulse.
NVIDIA at Dell Technologies World (cluster of 2) (@nvidia 17:49 UTC link, @nvidia 17:49 UTC keynote · NVIDIA event page). Jensen Huang and Michael Dell co-keynote on AI-accelerated enterprise compute, May 18-21 Las Vegas. PR-cycle event, included as cluster.