social-stream · 2026-05-11

2026-05-11-evening

Summary

A thin evening slot dominated by one substantive signal: an Apple paper on moving tool-call evaluation inside the execution loop, surfaced via an omarsar0 repost. The paper proposes a reviewer agent that inspects each provisional tool call before execution and introduces Helpfulness-Harmfulness metrics to quantify whether the reviewer fixes more errors than it creates. The rest of the slot is filler: an older repost on LLM Wikis plus HTML artifacts as a personal workflow primitive, a DAIR.AI course landing page bundled with the Apple tweet, and one Tesla Smart Summon clip that has no AI research content. Read the Apple paper, skip the rest.

Posts

  • In-loop reviewer agent for tool-calling (Apple) (@omarsar0 · paper). Moves agent evaluation from post-hoc trajectory analysis to inference-time intervention: a reviewer agent inspects each provisional tool call, injects feedback when it spots an error, and the primary agent revises before the call ships. They introduce Helpfulness (percent of base errors corrected) and Harmfulness (percent of correct calls degraded) to make the reviewer-as-net-positive question measurable. Reports +5.5% on BFCL irrelevance detection and +7.1% on Tau2-Bench multi-turn. Worth tracking against tool-calling work.
  • LLM Wikis + HTML artifacts as a workflow (@omarsar0). Argues that an LLM wiki captures the durable state your agents need, and HTML artifacts on top turn that state into interactive surfaces that both you and the agents can act on. Personal-workflow opinion piece, not a paper; relevant directionally because cere-bro is exactly this pattern.
  • DAIR.AI Vibe Coding Claude Code course (landing page). Promo for a paid Claude Code course bundled into the Apple-paper tweet. Skip.
  • Tesla Smart Summon clip (FSD v14.3.2) (@Tesla). Owner video of Smart Summon working in heavy rain, pull-over behavior matching Robotaxi. Product demo, no model or training detail. Skip.