llms-foundation-models · 2026-04-19 · Tier 2

ASGuard: Mechanistic Defense Against Targeted Jailbreaking

ASGuard: Mechanistic Defense Against Targeted Jailbreaking

Date: 2026-04-19
Source: HuggingFace Daily Papers
Paper: arxiv 2509.25843
Raw: raw/huggingface/2026-04-19-asguard-activation-scaling-guard-mitigate-targeted-jailbreaking.md


TL;DR

ASGuard uses circuit analysis to identify specific attention heads responsible for refusal behavior, then trains a channel-wise activation scaling vector to harden those heads against targeted jailbreaks (e.g. tense-changing attacks that rephrase harmful requests in past tense). Preventative fine-tuning with this vector achieves Pareto-optimal safety/utility tradeoff across four LLMs.


Key Findings

  • Tense-changing jailbreak: models that refuse harmful requests in present tense often comply when the same request is rephrased in past tense — a surprisingly simple and effective attack
  • Mechanistic root cause: adversarial suffixes suppress propagation of refusal-mediating activation directions in specific attention heads
  • Circuit analysis: ASGuard identifies the exact attention heads causally responsible for the tense-vulnerable refusal path
  • Activation scaling: a channel-wise scaling vector is trained to recalibrate those specific heads without touching the rest of the model
  • Preventative fine-tuning: the scaling is applied as a fine-tuning signal, forcing the model to learn a more robust refusal mechanism
  • Results: attack success rate drops across four LLMs; general capabilities and "helpfulness" preserved; less over-refusal than prior defenses

Why It Matters

Most safety defenses work at the input level (filtering) or output level (classifier). ASGuard works at the mechanism level — it finds the exact circuit responsible for a vulnerability and repairs it surgically. This is the mechanistic interpretability approach applied to alignment, not just to understanding.

The tense-changing attack is a good stress test because it's a minimal transformation — the semantic content is identical, only the grammatical tense changes. A model that is truly aligned on the concept of harm should be robust to this. Models that aren't are revealing that their refusal is surface-pattern-matching, not semantic.


Related Pages