agentic-systems · 2026-05-04 · Tier 1

Why Your Agentic AI Pentester Is Probably Just a Fancy Scanner — Ken Huang

Why Your Agentic AI Pentester Is Probably Just a Fancy Scanner — Ken Huang

Source: Agentic AI / Ken Huang Substack, 2026-05-04 · Post Raw: raw/rss/2026-05-04-agentic-ai-why-your-agentic-ai-pentester-is-probably-just-a-fancy.md Tier: 1 (agent architecture, security tooling)

TL;DR

Ken Huang dissects a Ridge Security benchmark of three agentic pentesters (RidgeGen, Shannon, Strix) on OWASP Juice Shop. All three used the same Gemini 3 Flash model so the variable under test is system architecture, not model capability. The numbers expose three architectural failure modes:

  • Belief state amnesia. Shannon and Strix treat each tool call as independent. RidgeGen maintains persistent belief state — when JWT alg:none is confirmed, the system updates its model of the application's authentication and reprioritizes its testing. Result: a cascade of 12 IDOR findings, mass assignment, vertical privilege escalation, all from one initial JWT bypass.
  • Evidence validation as architectural invariant vs best-effort output. RidgeGen produced 55 findings, all evidence-backed, 0% hallucination. Shannon produced 27 findings, 17 unconfirmed (template descriptions of vulnerability classes), 63% hallucination. The architecture either gates output on evidence collection or it does not.
  • Semantic reasoning vs syntactic pattern matching. Only RidgeGen found the negative-quantity-basket race condition (the model has to understand the financial transaction model). Pattern-matching tools find SQL injection; semantic tools find business-logic violations.

Token efficiency: Shannon 2138K tokens per finding (with 63% requiring manual validation); RidgeGen 846K tokens per confirmed finding.

Why it matters

This is the cleanest empirical demonstration the wiki has of the "harness > model" claim. GTA-2 (04-20) named the principle abstractly. Ken Huang's three-architecture comparison instantiates it concretely with three architectures, the same model held constant, and a measurable performance gap of >5x in evidence-backed findings.

For Tier 1 routing and agent design, the implications are direct:

  • Belief state is the missing routing input. Step-Level Optimization (05-02) routes based on trajectory state (Stuck/Milestone monitors); these monitors require a belief state to evaluate. Most production routers do not maintain it explicitly.
  • Evidence-validation-as-invariant is what AHE (05-04) calls a contract: the architecture's output is gated on a structural property, not a probabilistic check. AHE gates harness decisions on benchmark contracts; RidgeGen gates findings on execution evidence. Same architectural primitive.
  • Cascading exploitation is the security analog of Step-Level Optimization's escalation: confirm a vulnerability → reprioritize the search. Routing to escalate vs routing to expand are dual operations on the same trajectory state.

Connections

  • GTA-2 (2026-04-20) — execution harness dominates model capability. RidgeGen vs Shannon vs Strix at constant model is the cleanest experimental confirmation.
  • AHE (2026-05-04) — contract-based decisions. Evidence-validation-as-invariant is the security instantiation.
  • Defense Trilemma (2026-05-04) — Layer 2 (Agent Orchestration Layer) failure modes are exactly belief-state amnesia and absence of trust propagation. The trilemma argues no single defense is complete; this article shows that even at the offense side, single-architecture systems miss compound vulnerabilities.
  • Step-Level Optimization (2026-05-02) — trajectory-aware routing in computer-use agents. The trust-propagation pattern Huang describes is structurally identical: confirm an event, update belief state, reprioritize. Two domains, one mechanism.
  • Ken Huang Ch 14 routing + Ch 15 structured output (2026-05-01/04) — the same author's harness architecture series. This piece is the application of those harness principles to a security domain. The Ch 14 fallback chain and Ch 15 schema-identity caching are infrastructure-level architectural invariants; RidgeGen's belief-state propagation is the application-level architectural invariant.

Research angle (Tier 1)

  1. Belief state representation as a measurable harness property. Today every harness either has it or does not, but no public standard exists for "what counts as belief state." A formal definition (data structures, propagation rules, query interface) would let researchers compare harnesses on this dimension directly.
  2. Trust propagation in non-security agents. The cascading exploitation pattern works in pentesting because vulnerabilities compose. Whether the same propagation pattern transfers to non-security domains (debug-then-test, read-then-edit, plan-then-act) is largely unmeasured.
  3. Architecture-vs-model decomposition methodology. Ridge Security's same-model-different-architecture methodology is the right experimental design for harness research; it should become standard. Most agent benchmarks today vary both axes simultaneously.

Open questions

  • The benchmark is single-run on a single target (OWASP Juice Shop). Variance and generalization to real-world targets are unaddressed.
  • The Ridge Security benchmark sponsored the post — disclosure is upfront, but the comparative numbers should be replicated by independent evaluators before they become canonical.
  • Whether RidgeGen's architecture transfers without the benchmark's specific affordances (Docker sandbox, isolated network, stable API surface) is open.
  • Belief state representation is described conceptually but not in code; the load-bearing technical details (data structure, propagation rules) are not in the public post.