cere-bro | 2026-05-06
847 real deployed agents, 91% compromised by tool-chaining attacks. The week's agent benchmark cluster said capability is limited. Today's Marcus paper says the infrastructure underneath production agents is already broken.
TL;DR
- Agent Security Study — 847 deployed agents across healthcare, finance, customer service: 91% vulnerable to tool-chaining attacks, 89.4% drift after 30 steps, 94% of memory-augmented agents poisonable. The first large-scale empirical security audit of production agentic deployments. Tier 1.
- Musk admits xAI distilled from OpenAI — under cross-examination at week 2 of the trial. Brockman reveals $30B stake and IPO plans for the company now valued at $850B. Musk's earlier texts: "by the end of this week, you and Sam will be the most hated men in America."
- DeepSeek V4 — 1.6T parameters / 49B active (Pro), 1M-token context, open-sourced weights. Preview models claim to close the gap with frontier. First open MoE that seriously challenges the top tier on reasoning.
- OpenAI/Microsoft renegotiation — Microsoft's open-ended exclusivity is replaced with a nonexclusive license through 2032. OpenAI can now serve all clouds. AWS Bedrock gets OpenAI models.
- ProgramBench — a new benchmark every current LLM gets 0% on: recreate real programs (ffmpeg, SQLite, ripgrep) from scratch, no internet. Complements AcademiClaw (55%) and PhysicianBench (46%) from yesterday. The capability ceiling cluster is forming.
- Ctx2Skill — self-evolving multi-agent loop that extracts reusable skills from dense context without human supervision. Challenger/Reasoner/Judge with cross-time replay. Tier 2 agent architecture.
The Big Picture
The week has been building a two-sided argument and today's sources close it. One side is capability limits: PhysicianBench (46%), AcademiClaw (55%), ProgramBench (0%). The other side is Clark's Import AI timeline (05-05): fully autonomous AI R&D by end of 2028, 60% probability. Both sides of this argument are right, and they point at the same thing. Current agents are brittle at complex tasks. The benchmarks measure that. Clark's argument is that the engineering iteration loop — not creativity — is what gets automated first, and the engineering loop does not require solving PhysicianBench. It requires solving the repetitive 80% of coding, testing, and evaluation tasks where today's models are already close. The capability ceiling papers are measuring the wrong thing if you think Clark is right.
The Marcus agent security paper is a third thread that intersects both. It is not a capability study. It is a deployment study. 847 real agents, in production, across regulated industries. 91% compromised by tool-chaining attacks where the agent's own tool permissions become the attack surface. 89.4% showed goal drift after roughly 30 turns. 94% of agents using memory augmentation were poisonable via memory injection. The OpenClaw/Moltbook incident in section 9 of the paper: 770,000 live agents simultaneously compromised via a single database exploit, each with privileged access to its owner's machine, email, and files. This is not a red-team exercise. This is a documented incident. The wiki has been tracking T^2PO and Step-Level Optimization as the training and inference-side stability solutions for agent trajectories. They address the capability problem. The security problem is structurally different and neither paper touches it.
The Musk trial has produced something important for the wiki's technical threads. Under cross-examination, Musk acknowledged that xAI "partly" trained on OpenAI models via distillation, calling it "standard practice." Lambert's Distillation Panic piece (05-05) spent three pages arguing everyone does this. Now there is sworn testimony from the most prominent critic of OpenAI confirming it for his own lab. That is the clearest data point so far that distillation from closed models is not a fringe abuse pattern but a structural industry practice, and any legislation that treats it as an attack surface will land on everyone.
Deep Dives
Autonomous Agents are a Shitshow — Gary Marcus + Stanford/MIT/CMU Study
91% of 847 production agent deployments vulnerable to tool-chaining attacks. 89.4% goal drift after 30 steps. 770,000 agents simultaneously compromised in a single documented incident.
Source: Marcus on AI (Substack) + underlying paper from Stanford, MIT CSAIL, CMU, ITU Copenhagen, NVIDIA, Elloe AI Labs Links: Post Tier: 1 — agent architecture, security, deployment
ATTACK SURFACE TAXONOMY (from paper)
─────────────────────────────────────────────
Tool-chaining attacks 91% of agents vulnerable
(agent's own tool permissions become the weapon)
Goal drift 89.4% of agents after ~30 steps
(trajectory deviates from original task without detection)
Memory poisoning 94% of memory-augmented agents
(adversarial state injected via past context)
Stateless LLMs substantially less vulnerable
(no accumulated state to poison)
The paper's key insight is that agentic deployments are categorically more vulnerable than stateless LLMs. The reason is structural, not parametric. An agent accumulates tool permissions, memory state, and execution context across turns. Each of those is an attack surface that does not exist in a single-turn call. Tool-chaining attacks exploit the permission chain: the attacker does not need to compromise the model, only to construct a task that causes the agent to invoke tools in a sequence that produces the attacker's desired outcome. The agent never deviates from its stated objective; it follows it into a trap.
The 89.4% drift figure is the one that intersects the wiki's training-side work. Goal drift at turn 30 means the agent is still generating coherent-looking outputs while no longer pursuing the user's goal. T^2PO (05-05) addresses training-time instability by detecting low-information action chains. Step-Level Optimization (05-02) addresses inference-time drift by detecting trajectory stalls. Neither was evaluated on the adversarial injection cases the Marcus paper describes. The question that this paper opens is whether a T^2PO-trained, Step-Level Optimization-monitored agent has lower drift in the face of adversarial memory injection. The mechanisms operate on the same signal (trajectory information stalls) but have not been composed on adversarial inputs.
The OpenClaw/Moltbook incident (section 9) is the part of the paper that matters most for the digest's broader narrative. 770,000 live agents, compromised simultaneously via a single database exploit, each with privileged machine, email, and file access. The first author calls this "the first real-world empirical validation of the agentic threat model at scale." The wiki's agent benchmark cluster (AcademiClaw, PhysicianBench, ProgramBench) measures capability on well-formed tasks. Section 9 is the reminder that the deployment environment is not well-formed.
The connection to the Defense Trilemma (04-30) and the reward-hacking-grows-with-tools result (Wang/Huang, 04-30): the trilemma showed that complete safety guarantees across capability, alignment, and robustness are NP-hard. The Marcus paper provides the first empirical population estimate of what partial robustness looks like at production scale. 9% robust (on tool-chaining) and 6% robust (on memory poisoning) at the current state of practice. Those numbers will be worth tracking as harness design improves.
Why it matters: This is the first large-scale empirical audit of production agent deployments. The capability papers tell you agents fail at hard tasks. This paper tells you agents fail at safe operation on ordinary tasks when an adversary is present. Under Clark's automated AI R&D timeline, both failure modes have to be solved before the timeline closes.
Research angle: (1) T^2PO + Step-Level Optimization composition under adversarial memory injection: does training-time uncertainty control reduce inference-time susceptibility to poisoning? The mechanisms are adjacent but untested together in adversarial settings. (2) Tool permission minimization as a security primitive: what is the minimum tool set that allows task completion without creating exploitable permission chains? No current benchmark measures this. (3) Formal drift bound under adversarial context: the 30-step drift number needs a mechanistic explanation. Is it token accumulation, attention dilution, or context-window position effects?
Musk v. Altman, Week 2 — Last Week in AI #340
Musk testifies xAI "partly" distilled from OpenAI models. Brockman reveals $30B stake. OpenAI IPO floated. Trial enters week 2 with Sam Altman and Shivon Zilis still to testify.
Source: Last Week in AI #340 + Gary Marcus Gmail Links: Newsletter Tier: 1 — AI industry, governance, distillation
The most technically important admission of the trial came during cross-examination. OpenAI's lead counsel asked Musk whether xAI trained on OpenAI's outputs. Musk acknowledged "partly," adding "that's standard practice." That two-word qualifier — "standard practice" — is load-bearing. Lambert's Distillation Panic piece (05-05) spent three pages arguing exactly this. Now there is sworn testimony from the plaintiff confirming it for his own lab. If the court had been inclined to draw a clean line between legitimate distillation and abusive distillation, Musk's admission makes the line harder to draw: the most prominent critic of OpenAI used the technique he is suing over.
The financial disclosures are the second substantive thread. Brockman confirmed he owns close to $30B in OpenAI shares (which would make him one of the world's wealthiest people) plus $471M in Stripe shares. He confirmed OpenAI is exploring an IPO at its current $850B private valuation. The context matters for the wiki's industry thread: Anthropic and OpenAI both announced services arms this week (05-05 digest, Industry Pulse). OpenAI's fundraise for its "Deployment Company" looks different if an IPO is on the table by end of 2026 or early 2027. The services arm, the Microsoft deal restructuring, and the IPO signal are three moves pointing at the same transition: OpenAI is shifting from a product company to a platform company in preparation for public markets.
Musk's "you and Sam will be the most hated men in America" text to Brockman, sent two days before the trial started, is the kind of detail that only matters if you are tracking the interpersonal dynamics. The wiki is not. What matters is the timeline: Karpathy's Tesla-OpenAI merger suggestion and Musk's early push for control are now on the public record. The founding documents will be discoverable throughout the trial. Sam Altman and Shivon Zilis are scheduled to testify in the second half of the month.
Why it matters: The trial is slowly building a public record of the early decisions that shaped how OpenAI operates. The distillation admission is the most technically relevant piece so far. Everything else is context for how the industry's dominant lab got to where it is.
DeepSeek V4 Preview — Pro and Flash
1.6T parameters / 49B active (Pro). 1M-token context. Open-sourced weights. Claims to close the gap with frontier models on reasoning benchmarks.
Source: Last Week in AI #340 Links: Newsletter · Weights Tier: 1 — foundation models, MoE, open-source
DeepSeek V4 Pro is a 1.6T parameter MoE with 49B active parameters and a 1M-token context window. V4 Flash is 284B total / 13B active. Both are text-only, both are fully open-sourced on Hugging Face with a detailed tech report. The key claim is major efficiency and performance improvements over V3.2.
The architecture context matters. V3.2 was already the most capable open-weight model at its release. The MoE profile for V4 Pro (1.6T total / 49B active, roughly a 3% activation ratio) is more aggressive sparsity than V3 (671B total / 37B active). The 1M-token context is a direct response to Gemini 1.5's long-context position.
For the wiki's inference-efficiency thread, the scale jump is the thing to track. Running V4 Pro inference at reasonable latency with 49B active parameters requires a multi-GPU setup. The quantization and distillation questions that the wiki tracked with V3 (TurboQuant, PrfaaS, BLD) apply directly to V4. The 49B active parameter count is within reach of the compression techniques the wiki has been tracking, which means the open-weight V4 ecosystem will produce a wave of quantized and distilled variants within weeks. The Musk trial admission that xAI distilled from OpenAI makes V4's open weights the most obvious distillation target in the near term.
Why it matters: V4 Pro is the first open MoE that plausibly competes with GPT-5 class models on reasoning. Open weights at this scale change the inference deployment calculus for anyone not running on closed APIs.
Research angle: (1) V4 Pro + TurboQuant: 2.5-3.5 bit KV compression was demonstrated on smaller MoEs. The activation sparsity pattern in V4 Pro is different. What is the quality-compression tradeoff at 49B active? (2) Distillation from V4 Pro: open weights enable white-box distillation that was not possible with V3. The first paper to characterize V4 Pro's internal representations will define the next wave of specialist models.
Ctx2Skill — Self-Evolving Skill Extraction for Context Learning
A multi-agent self-play loop that autonomously discovers, refines, and selects context-specific skills without human supervision or external feedback. Challenger generates probing tasks. Reasoner attempts solutions guided by an evolving skill set. Judge provides binary feedback.
Source: HuggingFace Daily Papers Links: Paper · Wiki Tier: 2 — agent architecture, skill extraction, context learning
CTX2SKILL SELF-PLAY LOOP
──────────────────────────────────────────────────────
Challenger → generates probing tasks + rubrics from context
Reasoner → attempts tasks using evolving skill set
Judge → binary feedback (success / failure)
On failure:
Proposer → analyzes failure, generates skill update proposals
Generator → synthesizes proposals into targeted skill updates
Cross-time Replay mechanism:
Selects skill set with best balance across representative cases
(prevents adversarial collapse from increasingly extreme Challenger tasks)
──────────────────────────────────────────────────────
Output: portable skills pluggable into any LM for better context learning
The core problem Ctx2Skill attacks is that real-world tasks often depend on long, technically dense contexts where the relevant knowledge cannot be memorized parametrically. The solution — extracting reusable "skills" from that context — is intuitive, but building those skills automatically is hard. Manual annotation is expensive, and automated pipelines need feedback. Ctx2Skill's self-play loop sidesteps both problems by having agents generate their own probing tasks, fail on them, and synthesize the failures into skill updates.
The Cross-time Replay mechanism is the part worth keeping. Without it, the Challenger generates increasingly pathological probing tasks as the skill set improves, and both agents overfit to an adversarial game rather than generalizing. Replay forces skill selection to stay grounded across a representative sample of earlier cases. This is the same role that a held-out evaluation set plays in standard training, implemented as a within-loop mechanism.
The connection to the wiki's agent harness thread is direct. Ken Huang's pentester study (05-05) showed that belief-state propagation — accumulating structured intermediate conclusions rather than raw tool outputs — is what separates capable agents from pattern-matching wrappers. Ctx2Skill is building that belief state at a different granularity: not per-step evidence accumulation but per-context skill crystallization. The Reasoner's skill set is a belief state about what this specific context requires. The two architectures are complementary and have not been combined.
Why it matters: Ctx2Skill is the first clean self-supervised approach to skill extraction that includes a stability mechanism (Cross-time Replay) to prevent adversarial collapse. On CL-Bench, it consistently improves solving rates across backbone models. The plug-in portability means it can be layered onto any existing agent without retraining.
Industry Pulse
OpenAI/Microsoft renegotiation closed. (Last Week in AI) Microsoft's open-ended exclusivity is replaced with a nonexclusive license through 2032. OpenAI can now deploy across AWS, Google Cloud, and any other provider. Microsoft stops paying OpenAI a revenue share; OpenAI continues paying Microsoft through 2030 subject to a cap. Microsoft retains roughly 27% equity. AWS Bedrock gets OpenAI models and the upcoming Stateful Runtime Environment, the infrastructure layer for long-running agents. This is the deal that makes the Anthropic/AWS JV and OpenAI/AWS move structurally parallel: both frontier labs are now multi-cloud, not Microsoft-exclusive.
TobyPhln's three-year xAI retrospective. (Twitter/@TobyPhln) Toby Pohlen, founding xAI engineer and former DeepMind researcher, published a candid post-mortem on his first three years at xAI. Key admissions: built the API as first product because of technical interest, not strategic logic; wished he had been more vocal on production reliability and security roadmaps; acknowledged tension between engineering-focused thinking and founder-focused decision-making. The post is primarily an organizational reflection, not a technical disclosure. Worth noting because it is the clearest inside view of xAI's early engineering culture from someone who has now stepped back to assess.
Grok Imagine adds aspect ratio editing. (Twitter/@imagine) xAI's image product can now change the aspect ratio of any uploaded photo. Minor product update but confirms xAI's image product roadmap is active alongside the language model track.
ProgramBench: 0% on every current LLM. (Twitter/@deedydas) The creators of SWE-Bench dropped a new benchmark: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet access? Every tested model scores 0%. This is not a reasoning benchmark. It is a coherent long-form code generation benchmark where the output must actually run. The wiki's benchmark cluster now has four data points in one week: PhysicianBench (46%), AcademiClaw (55%), ProgramBench (0%), and the six-domain capability gaps in T^2PO's evaluation suite. Each benchmark probes a different failure mode; together they map the terrain.
Connecting the Dots
The agent security paper and the benchmark cluster compose into a single argument this week. Let me be direct about it.
BENCHMARK CLUSTER (capability ceiling, this week)
────────────────────────────────────────────────
PhysicianBench (05-05) 46% clinical EHR workflows
AcademiClaw (05-05) 55% academic-level multi-step tasks
ProgramBench (05-06) 0% recreate real programs from scratch
T^2PO eval suite (05-05) ~60% WebShop / ALFWorld / Search QA
SECURITY AUDIT (production deployment, 05-06)
──────────────────────────────────────────────
Tool-chaining attacks 91% vulnerable
Goal drift at step 30 89.4%
Memory poisoning 94% (memory-augmented agents)
Real incident 770K agents compromised simultaneously
Clark's Import AI timeline (05-05): 60% P(full auto AI R&D) by end 2028
The capability papers say agents cannot do hard tasks reliably. The security paper says agents in production are already being compromised on easy tasks. Clark's timeline says the engineering iteration loop — the easy part, not the hard part — is what closes first. These three arguments are not in tension. They are a coherent picture: the engineering-schlep automation Clark describes does not require solving PhysicianBench. It requires not getting tool-chain-attacked while running the repetitive parts of the eval loop. The security paper is the more pressing constraint on the timeline, not the capability papers.
The Musk distillation admission connects directly to Lambert's Distillation Panic (05-05). Lambert argued that "distillation attack" is a misnomer that will entangle legitimate practice. The trial just produced sworn testimony from the plaintiff that xAI itself uses the technique. Worth Watching from 05-05 predicted that the trial would surface more technical admissions. It did, one week in.
Cross-day thread: T^2PO (05-05) addresses training-time instability in multi-turn agents by detecting low-information action chains. The Marcus paper (05-06) shows that 89.4% of production agents exhibit goal drift after 30 steps. The mechanisms are adjacent: T^2PO's per-turn exploration signal is exactly what would need to be active to detect the drift the Marcus paper measures. Whether T^2PO-trained agents show reduced drift in production is the obvious empirical question. No one has tested it.
Worth Watching
Agent security vs. harness design, 60-day window. The Marcus paper established 91% tool-chaining vulnerability as the production baseline. The harness design papers (Ken Huang pentester 05-05, T^2PO 05-05, Step-Level Optimization 05-02) all address capability without touching adversarial settings. The first paper that tests T^2PO or Step-Level Optimization under tool-chaining attack conditions will either close this gap or confirm it is structural. If structural, Clark's 2028 timeline needs a security addendum.
DeepSeek V4 Pro + TurboQuant/BLD distillation wave. Open weights at 1.6T/49B active will produce compressed variants within 4-6 weeks. The question is whether the 3% activation ratio in V4 Pro changes the compression tradeoff relative to denser architectures. First paper to characterize V4 Pro's internal representations at scale sets the next wave.
OpenAI IPO signal. Brockman's testimony floated an IPO at $850B valuation. Combined with the Microsoft deal restructuring and the Deployment Company fundraise, the signals point at a public offering within 12-18 months. If OpenAI goes public, Anthropic is the only frontier lab with no stated path to liquidity. Watch for Anthropic's next funding announcement or any hint of a SPAC/strategic combination.
Musk v. Altman, week 3. Altman and Shivon Zilis are scheduled to testify. The founding documents are the key evidence. The question is whether the nonprofit charter language is tight enough to establish breach, or whether the for-profit conversion was legally defensible. The distillation admission has already produced one technically important datapoint. More will follow.
Quick Hits
Ctx2Skill across backbones. Ctx2Skill's self-evolving skill loop improves solving rates on CL-Bench across all tested backbones. The plug-in portability without retraining is the practical hook. (Paper)
PFlowNet: RL-grounded visual reasoning. Perceptual Flow Network separates perception from reasoning, combining multi-dimensional rewards with geometric shaping via variational RL. New state-of-the-art on V* Bench (90.6%) and MME-RealWorld-lite (67.0%). Tier 3 multimodal, but the variational RL mechanism is adjacent to T^2PO's uncertainty-based intervention. (Paper)
OceanPile. A large-scale multimodal ocean corpus with sonar data, underwater imagery, and marine text, built around an Ocean Concept Knowledge Graph for instruction alignment. Scientific domain MLLMs are a useful existence proof that multimodal fine-tuning generalizes far outside the standard vision-language benchmarks. Tier 4 otherwise. (Paper)
ComboStoc. Diffusion models trained on high-dimensional data with structured attributes underfit the combinatorial complexity. ComboStoc introduces stochastic processes over combinatorial structures, enabling asynchronous timestep generation per dimension. Training speedup demonstrated on images and 3D shapes. Tier 4 for this wiki but the combinatorial-structure framing is interesting for anyone working on structured generation. (Paper)
Designing Data-Intensive Applications, 2nd edition. The Pragmatic Engineer ran an excerpt from Martin Kleppmann and Chris Riccomini's update to the 2016 classic. Chapter 1 covers cloud vs. self-hosting tradeoffs. Not AI-specific, but the build-vs-buy framing applies directly to the inference infrastructure decisions labs are making now. Worth reading if you work on deployment.
Sources ingested today: 16 | Wiki pages updated: 7