WriteSAE: sparse autoencoders for the recurrent matrix cache write
Source: HuggingFace Daily Papers · 2026-05-14 Paper: arXiv 2605.12770 Raw: raw Tier: 2. Interpretability, recurrent/state-space models, mechanistic intervention
TL;DR
Residual SAEs read residual streams in transformers, but they can't reach the matrix-recurrent write of state-space and hybrid models like Gated DeltaNet, Mamba-2, and RWKV-7, because those models write to a d_k × d_v cache through rank-1 updates k_t v_t^T that no vector atom can replace. WriteSAE factors each decoder atom into the native rank-1 write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of 4,851 firings at Qwen3.5-0.8B L9 H4. Mamba-2-370M substitutes at 88.1%. Sustained installs at three positions lift mid-rank target-in-continuation from 33.3% to 100% under greedy decoding — the first behavioral install at the matrix-recurrent write site.
Why it matters
The interpretability thread in the wiki — Anthropic's natural language autoencoders, First Token Knows hallucination detection, the sycophancy-lying shared circuit Kurate cs.LG #11 from 2026-05-12 — has been moving from "find features" to "find features that can be installed." WriteSAE pushes this thread into the part of the architecture where transformer-tooling SAEs structurally cannot reach: the rank-1 cache updates of state-space and linear-attention models. As the wiki has tracked, hybrid Mamba/DeltaNet architectures have become the default for long-context small models (MDN 2026-05-11, Nemotron-3 Super 2026-04-21, Qwen 3.6 35B-A3B practitioner reports). Without WriteSAE, those models were opaque to mechanistic interpretability.
Mechanism
The structural problem: standard SAE atoms are vectors. They can substitute into a residual stream because the stream is a vector. But state-space and hybrid models write to a matrix cache via rank-1 outer products k_t v_t^T. A vector atom can't replace a rank-1 outer product.
WriteSAE's fix: factor each atom into the native write shape. Each atom is itself a rank-1 outer product, the same shape as the cache write. Three more design choices:
- Closed form for per-token logit shift. Given the cache write, the paper derives an analytical expression for how a substitution changes the token-level logits. This is what makes the install measurable and predictable.
- Matched Frobenius norm training. Atoms are trained at the same Frobenius norm as the cache writes they replace, so substitution is one-for-one in scale, not just shape.
- Sustained installs. A single substitution shifts logits; the paper shows sustained three-position installs at 3x lift bring mid-rank target-in-continuation from 33.3% to 100% under greedy decoding. That is a behavioral edit, not just a representational one.
The validation numbers are tight: 92.4% of 4,851 firings at Qwen3.5-0.8B L9 H4 beat matched-norm ablation, the 87-atom population test holds at 89.8%, the closed-form predicts measured effects at R² = 0.98, and Mamba-2-370M substitutes at 88.1% over 2,500 firings.
Connections
- MDN (Momentum DeltaNet) (2026-05-11) introduced the momentum variant of DeltaNet that the wiki has been tracking as the hybrid-architecture default. WriteSAE is the interpretability complement: the model class that MDN extends becomes mechanistically inspectable.
- First Token Knows (2026-05-08) found that hallucination signal is detectable from a single token's representation. WriteSAE provides the tool to intervene on that representation at the cache-write level in hybrid models, not just read it.
- Hodoscope (Kurate cs.AI #11, 2026-04-13) is the unsupervised monitoring paper from the 05-12 digest's Worth Watching. WriteSAE is the supervised intervention complement at the cache-write site. Together they bracket the supervised/unsupervised axis of agent-misbehavior monitoring on the hybrid-model class.
- Sycophancy-lying shared circuit (Kurate cs.LG #11, Pandey 2026-04-21) found a shared circuit between two failure modes. WriteSAE is the kind of tool that, applied to such a circuit on a Mamba-2 backbone, could install a corrected behavior at the cache-write site rather than at the attention-layer site.
Research angle
- Cross-architecture atom transfer. If WriteSAE atoms trained on Mamba-2 transfer to RWKV-7 or Gated DeltaNet, that would be evidence that the recurrent-write features have shared structure across architectures. This is the analogue of Anthropic's "universal features" claim from transformers, but for the state-space family. Untested.
- Atom-as-intervention for safety. The 33.3% → 100% behavioral lift demonstrates that sustained installs change generation. The safety question: can a single WriteSAE atom suppress refusal-neuron-like behavior at the cache-write site? This is the structural analogue to Kazemi's refusal-neurons finding (2026-05-12 retweet), which works in MLPs of dense transformers. WriteSAE makes the same investigation tractable on hybrid models.
- Reverse direction: read WriteSAE features as a monitor. Even without installing, the WriteSAE decomposition is itself a feature dictionary at the cache-write site. Using those features as a real-time monitor (Hodoscope-style) is the cleanest deployment path.
Where it lives
Update responsible-ai.md — first SAE in the wiki to reach state-space and hybrid models, and the first behavioral install at the matrix-recurrent write site.