SNLP: Layer-Parallel Inference via Structured Newton Corrections
arXiv: 2605.17842 · HF: paper page · Tier: 1 (layer-parallel inference, parallel decoding, training-aware serving)
TL;DR
SNLP reframes Transformer layer-by-layer execution as a nonlinear residual equation and solves it with parallel Newton-style updates whose Jacobians are replaced by architecture-induced surrogate dynamics. In residual Transformers this yields Identity Newton (IDN), which reduces to a prefix-sum-like update; in mHC-style architectures the residual mixing matrix gives HC Newton (HCN). An SNLP-aware regularizer trains the model so one or a few structured Newton iterations approximate the sequential forward, and that regularizer also improves baseline sequential perplexity by 4.7% to 23.4%. On a 0.5B Nanochat-scale model, SNLP plus layer fusion plus chunkwise decomposition reaches 2.3x wall-clock speedup while still improving PPL by 6.1%. Pretrained off-the-shelf models are less amenable; exact convergence recovers sequential execution, so there is no monotonic inference-time scaling.
Key findings
- The layer-by-layer dependency is a latency bottleneck that conventional tensor parallel and pipeline parallel cannot remove because each new layer needs the previous layer's hidden state.
- Treating the hidden-state trace across L layers as the fixed point of a nonlinear residual equation lets a Newton-style solver iterate toward the fixed point with corrections that can be computed in parallel across layers.
- Exact Newton requires Jacobian-vector products at each layer, which costs as much as the original sequential forward and offers no win. Naive fixed-point iteration on trained Transformers is unstable.
- SNLP replaces exact layer Jacobians with cheap architecture-induced surrogates. For residual Transformers, the surrogate is the identity, which makes the correction a prefix-sum-like update; SNLP calls this Identity Newton (IDN). For mHC-style architectures (multi-head Compressed attention, the family the wiki tracked via DeepSeek V4 mHC and Raschka's 2026-05-17 architecture catalog), the surrogate is the residual mixing matrix; SNLP calls this HC Newton (HCN).
- SNLP-aware regularization is a training-time loss that pushes the model to make one or a few structured Newton iterations approximate the full sequential forward. This regularizer also reduces baseline sequential PPL by 4.7% to 23.4%, an unusual case where an inference-acceleration objective is also a quality-improvement objective.
- At inference, combining SNLP with layer fusion and chunkwise decomposition gives 2.3x wall-clock speedup on 0.5B Nanochat-scale models while improving PPL by 6.1%.
- Off-the-shelf pretrained models are less amenable: the structured Newton surrogate only works cleanly if the model was trained to be amenable to it.
- Exact convergence of the Newton iteration recovers sequential execution. SNLP does not provide monotonic inference-time scaling (more iterations does not strictly improve quality the way more rollouts can in RLVR test-time compute).
Relationship to prior wiki entries
SNLP is a new axis of inference acceleration in the wiki: layer parallelism via solver-induced inference bias. The wiki's prior layer-parallel work consisted of pipeline parallelism (which only helps across requests, not within a single forward pass) and speculative decoding (which is token-parallel, not layer-parallel). SNLP is genuinely layer-parallel within a single forward pass.
The mHC surrogate (HC Newton) lands precisely in the architecture surface Raschka's 2026-05-17 architecture catalog (the LinkedIn-newsletter post cataloguing Gemma 4's KV sharing, Laguna XS.2's layer-wise attention budgeting, ZAYA1-8B's compressed convolutional attention, DeepSeek V4's multi-head Compressed attention) was mapping. DeepSeek V4 mHC was one of the May open-model architectures specifically called out. SNLP's HC Newton uses the residual mixing matrix that mHC uses, which means a frontier open MoE built on mHC has a natural SNLP companion.
SNLP composes with parallel decoding methods. Orthrus (2026-05-14, the paper that runs an AR head and a diffusion head on the same frozen LLM both attending to the same shared KV cache, with an exact-consensus mechanism producing bit-identical AR output at up to 7.8x speedup) is token-parallel. SNLP is layer-parallel. Both can run on the same forward pass.
Why it matters
Layer-parallel inference is the dimension nobody has cracked at scale. SNLP's framing (Newton-style solver with architecture-induced surrogates) is the most principled approach the wiki has seen. The fact that SNLP-aware regularization also improves baseline PPL is the surprising part: it means SNLP is not a quality-cost tradeoff but a quality-improving inference-acceleration method, in the same class as Make Each Token Count's selective KV retention.
The off-the-shelf-pretrained limitation is the deployment caveat. SNLP-aware regularization is a training-time intervention. Frontier labs would need to bake it into pre-training or post-training to get the wall-clock benefit. If a major lab adopts SNLP regularization in the next foundation-model training run, the layer-parallel inference axis is suddenly live.
Research angle
- Does SNLP-aware regularization compose with MoE-muP (2026-05-17 Kurate cs.LG #13, the Vankadara et al. paper deriving closed-form Maximally Scale-Stable Parameterization across the five MoE axes)? The two are training-time interventions on orthogonal substrates: one for parameterization stability across scale, one for layer-parallel inference. Compose into a single pre-training recipe.
- At what scale does HCN beat IDN? IDN is simpler. HCN exploits the residual mixing matrix. Whether the gap grows with depth or with MoE configuration is the cross-architecture question.
- Inference-time scaling under bounded Newton iterations. SNLP gives no monotonic scaling at exact convergence. Whether a bounded-iteration regime can exchange iterations for quality (similar to test-time scaling in RLVR) is open.
Source
raw/huggingface/2026-05-19-snlp-layer-parallel-inference-via-structured-newton-correcti.md