2026-05-12-afternoon

Summary

Dense afternoon, almost entirely from @bayesiansapien curated reposts (15 RTs, only one weak post from the AI handle feed). The single strongest signal is AutoTTS (cluster of 2): @zhengtoong's authors' thread plus @omarsar0's analytical writeup of the same paper, both framing it as the moment where humans stop hand-tuning test-time scaling and instead build environments where LLM agents discover the strategy ($39.9, ~160 min to beat hand-crafted baselines). The wiki already has a page on it from yesterday, so this is cross-source confirmation rather than a new signal. Beyond AutoTTS, the slot carries a Thinking Machines release of "interaction models" (TML-Interaction-Small, 276B MoE, 12B active, micro-turn always-on conversation), a PwC paper that breaks the "ask early" intuition for agent clarification (goal-clarifications lose nearly all value past 10% of trajectory), a steerability result showing tool-calling decisions are linearly readable in 12 instruction-tuned models, an Anthropic talk teasing memory + a "Dreaming" feature as the next first-class primitive, and a Microsoft/Salesforce 200K-conversation analysis claiming an average 39% accuracy drop as conversations lengthen. Three reposts are opaque x.com/i/article/ links the reader has to click through to read.

Posts

AutoTTS — frontier LLMs design their own test-time scaling strategies (cluster of 2) (@zhengtoong 01:38 UTC, @omarsar0 23:19 UTC · arxiv · wiki). Environment-driven discovery framework where humans design the search environment and coding agents discover the width-depth TTS controller. Total discovery cost $39.9 and ~160 minutes, results generalize to held-out benchmarks and model scales. Second day of independent signal on this paper.
Clarification timing in long-horizon agents (PwC) (@dair_ai · arxiv). Forced-injection framework across 4 frontier models, 84 task variants, 6,000+ runs. Goal clarification loses nearly all value after 10% of execution (pass@3 drops from 0.78 to baseline); input clarification holds through ~50%; deferring past mid-trajectory is worse than never asking. No current frontier model asks inside the empirically optimal window. Empirical brake on the "always ask early" prior.
Tool calling is linearly readable and steerable (@tldr_ai_papers · arxiv). Probes 12 instruction-tuned models (Gemma 3, Qwen 3, Qwen 2.5, Llama 3.1, 270M to 27B). Adding the mean-difference between two tools' average activations flips the chosen tool with 77-100% accuracy on name-only prompts (93-100% at 4B+), and the JSON arguments autoregressively conform to the new tool's schema. Small set of mid- and late-layer attention heads localized via patching. Mechanistic handle on tool-selection failure modes, directly relevant to tool calling.
Thinking Machines — Interaction Models (TML-Interaction-Small) (@rohanpaul_ai · blog). 276B MoE, 12B active. Replaces walkie-talkie turn-taking with always-present AI: audio/video/text sliced into 200ms micro-turns, model listens-watches-speaks-acts-tool-calls while the interaction is still happening. Trained from scratch with a multi-stream micro-turn design. First production research preview of "interactivity scales alongside intelligence" as a thesis.
RAO: Recursive Agent Optimization (@apurvasgandhi). End-to-end RL for training LLMs to spawn, delegate to, and coordinate with recursive copies of themselves. Frames sub-agents as an inference-time scaling primitive (working memory, parallel decomposition) and the question as how to train the parent to exploit them. Closely adjacent to the Sakana Conductor + AutoTTS thread on learned orchestration.
Anthropic memory + "Dreaming" (continual learning preview) (@daniel_mac8). Reports on a recent Anthropic talk framing memory as the next first-class agent primitive after MCP, Skills, and harnesses: writable shared context, provenance, review, background consolidation. "Dreaming" is described as recursive self-improvement at the agent-system level, an early form of continual learning that becomes load-bearing once infinite context arrives.
GEPA explainer — RL struggles with long-horizon agents (@blc_16). Walkthrough of why sparse rewards throw away trajectory information and how GEPA learns from the trajectory itself via textual critiques, prompt edits, and Pareto-frontier selection across exploration and exploitation. Useful framing of the prompt-optimizer-as-credit-assigner direction.
Microsoft + Salesforce — 200K conversations, 39% average degradation (@HowToAI_). Cited numbers: ChatGPT 96.6% to 72.6%, Gemini 97.4% to 68.1% as conversations lengthen. Attributed to an "anchoring trap": models commit to wrong assumptions early and cannot recover. Popular framing of the multi-turn drift problem; the underlying mechanism overlaps with the PwC clarification-timing paper in the same slot.
Curved geometry of LLM activations (@che_shr_cat). Argues the Linear Representation Hypothesis is a useful lie that breaks down fast: straight-line steering in flat Euclidean space produces "teleportation" and diversity collapse, and the real geometry is curved. Conceptually adjacent to the tool-steerability paper above but pointing in the opposite direction on whether linear edits suffice.
Nature Neuroscience — brains do not predict every word uniformly (@ValerioCapraro). Zou, Poeppel, Ding: brain activity tracks word surprisal LLM-style inside phrases but the match weakens across major phrase boundaries. Prediction is constrained by linguistic structure. A counterweight to the "humans are just next-word predictors" frame.
Claude Code's 5 architectural layers (@NainsiDwiv50980). Field guide thread covering CLAUDE.md as memory layer, plus four further layers that have nothing to do with prompting. Most of the substance is already captured in the wiki's Claude Code architecture pages, but the framing is a clean public summary.
Opaque x.com/i/article reposts (click through to read) (@AmarSVS, @AlphaSignalAI, @neural_avb). Curated retweets pointing to X-native long-form articles the synthesis cannot expand inline.
@magicsilicon "Whoa 😳" (@magicsilicon). Reaction post, no content. Skip.