Why Your Agentic AI Pentester Is Probably Just a Fancy Scanner — Ken Huang

Source: Agentic AI / Ken Huang Substack, 2026-05-04 · Post Raw: raw/rss/2026-05-04-agentic-ai-why-your-agentic-ai-pentester-is-probably-just-a-fancy.md Tier: 1 (agent architecture, security tooling)

TL;DR

Ken Huang dissects a Ridge Security benchmark of three agentic pentesters (RidgeGen, Shannon, Strix) on OWASP Juice Shop. All three used the same Gemini 3 Flash model so the variable under test is system architecture, not model capability. The numbers expose three architectural failure modes:

Belief state amnesia. Shannon and Strix treat each tool call as independent. RidgeGen maintains persistent belief state — when JWT alg:none is confirmed, the system updates its model of the application's authentication and reprioritizes its testing. Result: a cascade of 12 IDOR findings, mass assignment, vertical privilege escalation, all from one initial JWT bypass.
Evidence validation as architectural invariant vs best-effort output. RidgeGen produced 55 findings, all evidence-backed, 0% hallucination. Shannon produced 27 findings, 17 unconfirmed (template descriptions of vulnerability classes), 63% hallucination. The architecture either gates output on evidence collection or it does not.
Semantic reasoning vs syntactic pattern matching. Only RidgeGen found the negative-quantity-basket race condition (the model has to understand the financial transaction model). Pattern-matching tools find SQL injection; semantic tools find business-logic violations.

Token efficiency: Shannon 2138K tokens per finding (with 63% requiring manual validation); RidgeGen 846K tokens per confirmed finding.

Why it matters

This is the cleanest empirical demonstration the wiki has of the "harness > model" claim. GTA-2 (04-20) named the principle abstractly. Ken Huang's three-architecture comparison instantiates it concretely with three architectures, the same model held constant, and a measurable performance gap of >5x in evidence-backed findings.

For Tier 1 routing and agent design, the implications are direct:

Belief state is the missing routing input. Step-Level Optimization (05-02) routes based on trajectory state (Stuck/Milestone monitors); these monitors require a belief state to evaluate. Most production routers do not maintain it explicitly.
Evidence-validation-as-invariant is what AHE (05-04) calls a contract: the architecture's output is gated on a structural property, not a probabilistic check. AHE gates harness decisions on benchmark contracts; RidgeGen gates findings on execution evidence. Same architectural primitive.
Cascading exploitation is the security analog of Step-Level Optimization's escalation: confirm a vulnerability → reprioritize the search. Routing to escalate vs routing to expand are dual operations on the same trajectory state.

Connections

GTA-2 (2026-04-20) — execution harness dominates model capability. RidgeGen vs Shannon vs Strix at constant model is the cleanest experimental confirmation.
AHE (2026-05-04) — contract-based decisions. Evidence-validation-as-invariant is the security instantiation.
Defense Trilemma (2026-05-04) — Layer 2 (Agent Orchestration Layer) failure modes are exactly belief-state amnesia and absence of trust propagation. The trilemma argues no single defense is complete; this article shows that even at the offense side, single-architecture systems miss compound vulnerabilities.
Step-Level Optimization (2026-05-02) — trajectory-aware routing in computer-use agents. The trust-propagation pattern Huang describes is structurally identical: confirm an event, update belief state, reprioritize. Two domains, one mechanism.
Ken Huang Ch 14 routing + Ch 15 structured output (2026-05-01/04) — the same author's harness architecture series. This piece is the application of those harness principles to a security domain. The Ch 14 fallback chain and Ch 15 schema-identity caching are infrastructure-level architectural invariants; RidgeGen's belief-state propagation is the application-level architectural invariant.

Research angle (Tier 1)

Belief state representation as a measurable harness property. Today every harness either has it or does not, but no public standard exists for "what counts as belief state." A formal definition (data structures, propagation rules, query interface) would let researchers compare harnesses on this dimension directly.
Trust propagation in non-security agents. The cascading exploitation pattern works in pentesting because vulnerabilities compose. Whether the same propagation pattern transfers to non-security domains (debug-then-test, read-then-edit, plan-then-act) is largely unmeasured.
Architecture-vs-model decomposition methodology. Ridge Security's same-model-different-architecture methodology is the right experimental design for harness research; it should become standard. Most agent benchmarks today vary both axes simultaneously.

Open questions

The benchmark is single-run on a single target (OWASP Juice Shop). Variance and generalization to real-world targets are unaddressed.
The Ridge Security benchmark sponsored the post — disclosure is upfront, but the comparative numbers should be replicated by independent evaluators before they become canonical.
Whether RidgeGen's architecture transfers without the benchmark's specific affordances (Docker sandbox, isolated network, stable API surface) is open.
Belief state representation is described conceptually but not in code; the load-bearing technical details (data structure, propagation rules) are not in the public post.