May 6, 2026 · daily digest

cere-bro | 2026-05-06

cere-bro | 2026-05-06

847 real deployed agents, 91% compromised by tool-chaining attacks. The week's agent benchmark cluster said capability is limited. Today's Marcus paper says the infrastructure underneath production agents is already broken.


TL;DR


The Big Picture

The week has been building a two-sided argument and today's sources close it. One side is capability limits: PhysicianBench (46%), AcademiClaw (55%), ProgramBench (0%). The other side is Clark's Import AI timeline (05-05): fully autonomous AI R&D by end of 2028, 60% probability. Both sides of this argument are right, and they point at the same thing. Current agents are brittle at complex tasks. The benchmarks measure that. Clark's argument is that the engineering iteration loop — not creativity — is what gets automated first, and the engineering loop does not require solving PhysicianBench. It requires solving the repetitive 80% of coding, testing, and evaluation tasks where today's models are already close. The capability ceiling papers are measuring the wrong thing if you think Clark is right.

The Marcus agent security paper is a third thread that intersects both. It is not a capability study. It is a deployment study. 847 real agents, in production, across regulated industries. 91% compromised by tool-chaining attacks where the agent's own tool permissions become the attack surface. 89.4% showed goal drift after roughly 30 turns. 94% of agents using memory augmentation were poisonable via memory injection. The OpenClaw/Moltbook incident in section 9 of the paper: 770,000 live agents simultaneously compromised via a single database exploit, each with privileged access to its owner's machine, email, and files. This is not a red-team exercise. This is a documented incident. The wiki has been tracking T^2PO and Step-Level Optimization as the training and inference-side stability solutions for agent trajectories. They address the capability problem. The security problem is structurally different and neither paper touches it.

The Musk trial has produced something important for the wiki's technical threads. Under cross-examination, Musk acknowledged that xAI "partly" trained on OpenAI models via distillation, calling it "standard practice." Lambert's Distillation Panic piece (05-05) spent three pages arguing everyone does this. Now there is sworn testimony from the most prominent critic of OpenAI confirming it for his own lab. That is the clearest data point so far that distillation from closed models is not a fringe abuse pattern but a structural industry practice, and any legislation that treats it as an attack surface will land on everyone.


Deep Dives


Autonomous Agents are a Shitshow — Gary Marcus + Stanford/MIT/CMU Study

91% of 847 production agent deployments vulnerable to tool-chaining attacks. 89.4% goal drift after 30 steps. 770,000 agents simultaneously compromised in a single documented incident.

Source: Marcus on AI (Substack) + underlying paper from Stanford, MIT CSAIL, CMU, ITU Copenhagen, NVIDIA, Elloe AI Labs Links: Post Tier: 1 — agent architecture, security, deployment

ATTACK SURFACE TAXONOMY (from paper)
─────────────────────────────────────────────
Tool-chaining attacks      91% of agents vulnerable
  (agent's own tool permissions become the weapon)

Goal drift                 89.4% of agents after ~30 steps
  (trajectory deviates from original task without detection)

Memory poisoning           94% of memory-augmented agents
  (adversarial state injected via past context)

Stateless LLMs             substantially less vulnerable
  (no accumulated state to poison)

The paper's key insight is that agentic deployments are categorically more vulnerable than stateless LLMs. The reason is structural, not parametric. An agent accumulates tool permissions, memory state, and execution context across turns. Each of those is an attack surface that does not exist in a single-turn call. Tool-chaining attacks exploit the permission chain: the attacker does not need to compromise the model, only to construct a task that causes the agent to invoke tools in a sequence that produces the attacker's desired outcome. The agent never deviates from its stated objective; it follows it into a trap.

The 89.4% drift figure is the one that intersects the wiki's training-side work. Goal drift at turn 30 means the agent is still generating coherent-looking outputs while no longer pursuing the user's goal. T^2PO (05-05) addresses training-time instability by detecting low-information action chains. Step-Level Optimization (05-02) addresses inference-time drift by detecting trajectory stalls. Neither was evaluated on the adversarial injection cases the Marcus paper describes. The question that this paper opens is whether a T^2PO-trained, Step-Level Optimization-monitored agent has lower drift in the face of adversarial memory injection. The mechanisms operate on the same signal (trajectory information stalls) but have not been composed on adversarial inputs.

The OpenClaw/Moltbook incident (section 9) is the part of the paper that matters most for the digest's broader narrative. 770,000 live agents, compromised simultaneously via a single database exploit, each with privileged machine, email, and file access. The first author calls this "the first real-world empirical validation of the agentic threat model at scale." The wiki's agent benchmark cluster (AcademiClaw, PhysicianBench, ProgramBench) measures capability on well-formed tasks. Section 9 is the reminder that the deployment environment is not well-formed.

The connection to the Defense Trilemma (04-30) and the reward-hacking-grows-with-tools result (Wang/Huang, 04-30): the trilemma showed that complete safety guarantees across capability, alignment, and robustness are NP-hard. The Marcus paper provides the first empirical population estimate of what partial robustness looks like at production scale. 9% robust (on tool-chaining) and 6% robust (on memory poisoning) at the current state of practice. Those numbers will be worth tracking as harness design improves.

Why it matters: This is the first large-scale empirical audit of production agent deployments. The capability papers tell you agents fail at hard tasks. This paper tells you agents fail at safe operation on ordinary tasks when an adversary is present. Under Clark's automated AI R&D timeline, both failure modes have to be solved before the timeline closes.

Research angle: (1) T^2PO + Step-Level Optimization composition under adversarial memory injection: does training-time uncertainty control reduce inference-time susceptibility to poisoning? The mechanisms are adjacent but untested together in adversarial settings. (2) Tool permission minimization as a security primitive: what is the minimum tool set that allows task completion without creating exploitable permission chains? No current benchmark measures this. (3) Formal drift bound under adversarial context: the 30-step drift number needs a mechanistic explanation. Is it token accumulation, attention dilution, or context-window position effects?


Musk v. Altman, Week 2 — Last Week in AI #340

Musk testifies xAI "partly" distilled from OpenAI models. Brockman reveals $30B stake. OpenAI IPO floated. Trial enters week 2 with Sam Altman and Shivon Zilis still to testify.

Source: Last Week in AI #340 + Gary Marcus Gmail Links: Newsletter Tier: 1 — AI industry, governance, distillation

The most technically important admission of the trial came during cross-examination. OpenAI's lead counsel asked Musk whether xAI trained on OpenAI's outputs. Musk acknowledged "partly," adding "that's standard practice." That two-word qualifier — "standard practice" — is load-bearing. Lambert's Distillation Panic piece (05-05) spent three pages arguing exactly this. Now there is sworn testimony from the plaintiff confirming it for his own lab. If the court had been inclined to draw a clean line between legitimate distillation and abusive distillation, Musk's admission makes the line harder to draw: the most prominent critic of OpenAI used the technique he is suing over.

The financial disclosures are the second substantive thread. Brockman confirmed he owns close to $30B in OpenAI shares (which would make him one of the world's wealthiest people) plus $471M in Stripe shares. He confirmed OpenAI is exploring an IPO at its current $850B private valuation. The context matters for the wiki's industry thread: Anthropic and OpenAI both announced services arms this week (05-05 digest, Industry Pulse). OpenAI's fundraise for its "Deployment Company" looks different if an IPO is on the table by end of 2026 or early 2027. The services arm, the Microsoft deal restructuring, and the IPO signal are three moves pointing at the same transition: OpenAI is shifting from a product company to a platform company in preparation for public markets.

Musk's "you and Sam will be the most hated men in America" text to Brockman, sent two days before the trial started, is the kind of detail that only matters if you are tracking the interpersonal dynamics. The wiki is not. What matters is the timeline: Karpathy's Tesla-OpenAI merger suggestion and Musk's early push for control are now on the public record. The founding documents will be discoverable throughout the trial. Sam Altman and Shivon Zilis are scheduled to testify in the second half of the month.

Why it matters: The trial is slowly building a public record of the early decisions that shaped how OpenAI operates. The distillation admission is the most technically relevant piece so far. Everything else is context for how the industry's dominant lab got to where it is.


DeepSeek V4 Preview — Pro and Flash

1.6T parameters / 49B active (Pro). 1M-token context. Open-sourced weights. Claims to close the gap with frontier models on reasoning benchmarks.

Source: Last Week in AI #340 Links: Newsletter · Weights Tier: 1 — foundation models, MoE, open-source

DeepSeek V4 Pro is a 1.6T parameter MoE with 49B active parameters and a 1M-token context window. V4 Flash is 284B total / 13B active. Both are text-only, both are fully open-sourced on Hugging Face with a detailed tech report. The key claim is major efficiency and performance improvements over V3.2.

The architecture context matters. V3.2 was already the most capable open-weight model at its release. The MoE profile for V4 Pro (1.6T total / 49B active, roughly a 3% activation ratio) is more aggressive sparsity than V3 (671B total / 37B active). The 1M-token context is a direct response to Gemini 1.5's long-context position.

For the wiki's inference-efficiency thread, the scale jump is the thing to track. Running V4 Pro inference at reasonable latency with 49B active parameters requires a multi-GPU setup. The quantization and distillation questions that the wiki tracked with V3 (TurboQuant, PrfaaS, BLD) apply directly to V4. The 49B active parameter count is within reach of the compression techniques the wiki has been tracking, which means the open-weight V4 ecosystem will produce a wave of quantized and distilled variants within weeks. The Musk trial admission that xAI distilled from OpenAI makes V4's open weights the most obvious distillation target in the near term.

Why it matters: V4 Pro is the first open MoE that plausibly competes with GPT-5 class models on reasoning. Open weights at this scale change the inference deployment calculus for anyone not running on closed APIs.

Research angle: (1) V4 Pro + TurboQuant: 2.5-3.5 bit KV compression was demonstrated on smaller MoEs. The activation sparsity pattern in V4 Pro is different. What is the quality-compression tradeoff at 49B active? (2) Distillation from V4 Pro: open weights enable white-box distillation that was not possible with V3. The first paper to characterize V4 Pro's internal representations will define the next wave of specialist models.


Ctx2Skill — Self-Evolving Skill Extraction for Context Learning

A multi-agent self-play loop that autonomously discovers, refines, and selects context-specific skills without human supervision or external feedback. Challenger generates probing tasks. Reasoner attempts solutions guided by an evolving skill set. Judge provides binary feedback.

Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2 — agent architecture, skill extraction, context learning

CTX2SKILL SELF-PLAY LOOP
──────────────────────────────────────────────────────
Challenger  →  generates probing tasks + rubrics from context
Reasoner    →  attempts tasks using evolving skill set
Judge       →  binary feedback (success / failure)

On failure:
  Proposer  →  analyzes failure, generates skill update proposals
  Generator →  synthesizes proposals into targeted skill updates

Cross-time Replay mechanism:
  Selects skill set with best balance across representative cases
  (prevents adversarial collapse from increasingly extreme Challenger tasks)
──────────────────────────────────────────────────────
Output: portable skills pluggable into any LM for better context learning

The core problem Ctx2Skill attacks is that real-world tasks often depend on long, technically dense contexts where the relevant knowledge cannot be memorized parametrically. The solution — extracting reusable "skills" from that context — is intuitive, but building those skills automatically is hard. Manual annotation is expensive, and automated pipelines need feedback. Ctx2Skill's self-play loop sidesteps both problems by having agents generate their own probing tasks, fail on them, and synthesize the failures into skill updates.

The Cross-time Replay mechanism is the part worth keeping. Without it, the Challenger generates increasingly pathological probing tasks as the skill set improves, and both agents overfit to an adversarial game rather than generalizing. Replay forces skill selection to stay grounded across a representative sample of earlier cases. This is the same role that a held-out evaluation set plays in standard training, implemented as a within-loop mechanism.

The connection to the wiki's agent harness thread is direct. Ken Huang's pentester study (05-05) showed that belief-state propagation — accumulating structured intermediate conclusions rather than raw tool outputs — is what separates capable agents from pattern-matching wrappers. Ctx2Skill is building that belief state at a different granularity: not per-step evidence accumulation but per-context skill crystallization. The Reasoner's skill set is a belief state about what this specific context requires. The two architectures are complementary and have not been combined.

Why it matters: Ctx2Skill is the first clean self-supervised approach to skill extraction that includes a stability mechanism (Cross-time Replay) to prevent adversarial collapse. On CL-Bench, it consistently improves solving rates across backbone models. The plug-in portability means it can be layered onto any existing agent without retraining.

Full summary


Industry Pulse


Connecting the Dots

The agent security paper and the benchmark cluster compose into a single argument this week. Let me be direct about it.

BENCHMARK CLUSTER (capability ceiling, this week)
────────────────────────────────────────────────
PhysicianBench (05-05)    46%  clinical EHR workflows
AcademiClaw (05-05)       55%  academic-level multi-step tasks
ProgramBench (05-06)       0%  recreate real programs from scratch
T^2PO eval suite (05-05)  ~60% WebShop / ALFWorld / Search QA

SECURITY AUDIT (production deployment, 05-06)
──────────────────────────────────────────────
Tool-chaining attacks     91% vulnerable
Goal drift at step 30     89.4%
Memory poisoning          94% (memory-augmented agents)
Real incident             770K agents compromised simultaneously

Clark's Import AI timeline (05-05): 60% P(full auto AI R&D) by end 2028

The capability papers say agents cannot do hard tasks reliably. The security paper says agents in production are already being compromised on easy tasks. Clark's timeline says the engineering iteration loop — the easy part, not the hard part — is what closes first. These three arguments are not in tension. They are a coherent picture: the engineering-schlep automation Clark describes does not require solving PhysicianBench. It requires not getting tool-chain-attacked while running the repetitive parts of the eval loop. The security paper is the more pressing constraint on the timeline, not the capability papers.

The Musk distillation admission connects directly to Lambert's Distillation Panic (05-05). Lambert argued that "distillation attack" is a misnomer that will entangle legitimate practice. The trial just produced sworn testimony from the plaintiff that xAI itself uses the technique. Worth Watching from 05-05 predicted that the trial would surface more technical admissions. It did, one week in.

Cross-day thread: T^2PO (05-05) addresses training-time instability in multi-turn agents by detecting low-information action chains. The Marcus paper (05-06) shows that 89.4% of production agents exhibit goal drift after 30 steps. The mechanisms are adjacent: T^2PO's per-turn exploration signal is exactly what would need to be active to detect the drift the Marcus paper measures. Whether T^2PO-trained agents show reduced drift in production is the obvious empirical question. No one has tested it.


Worth Watching


Quick Hits

Ctx2Skill across backbones. Ctx2Skill's self-evolving skill loop improves solving rates on CL-Bench across all tested backbones. The plug-in portability without retraining is the practical hook. (Paper)

PFlowNet: RL-grounded visual reasoning. Perceptual Flow Network separates perception from reasoning, combining multi-dimensional rewards with geometric shaping via variational RL. New state-of-the-art on V* Bench (90.6%) and MME-RealWorld-lite (67.0%). Tier 3 multimodal, but the variational RL mechanism is adjacent to T^2PO's uncertainty-based intervention. (Paper)

OceanPile. A large-scale multimodal ocean corpus with sonar data, underwater imagery, and marine text, built around an Ocean Concept Knowledge Graph for instruction alignment. Scientific domain MLLMs are a useful existence proof that multimodal fine-tuning generalizes far outside the standard vision-language benchmarks. Tier 4 otherwise. (Paper)

ComboStoc. Diffusion models trained on high-dimensional data with structured attributes underfit the combinatorial complexity. ComboStoc introduces stochastic processes over combinatorial structures, enabling asynchronous timestep generation per dimension. Training speedup demonstrated on images and 3D shapes. Tier 4 for this wiki but the combinatorial-structure framing is interesting for anyone working on structured generation. (Paper)

Designing Data-Intensive Applications, 2nd edition. The Pragmatic Engineer ran an excerpt from Martin Kleppmann and Chris Riccomini's update to the 2016 classic. Chapter 1 covers cloud vs. self-hosting tradeoffs. Not AI-specific, but the build-vs-buy framing applies directly to the inference infrastructure decisions labs are making now. Worth reading if you work on deployment.


Sources ingested today: 16 | Wiki pages updated: 7