Switch-KD: Visual-Switch Knowledge Distillation for VLMs
Date: 2026-04-18
Tier: 1 — Distillation (multimodal extension)
arXiv: 2604.14629
Raw: source
TL;DR
VLM distillation fails because visual and language supervision are applied in separate modality silos — the student gets misaligned multimodal knowledge. Switch-KD forces the transfer through a single shared space: the text probability distribution. It "switches" the student's visual outputs into the teacher's language pathway, then uses a bidirectional logit-difference loss. A 0.5B TinyLLaVA distilling from a 3B teacher gains 3.6 points averaged across 10 benchmarks with no architecture changes.
Mechanism
Standard VLM distillation supervises vision and language outputs separately. But in VLMs, visual representations are fused inside the language space anyway — the late-fusion architecture means visual tokens become language-space vectors before any generation step. Switch-KD exploits this: it routes the student's visual features through the teacher's language pathway to build cross-modal reference distributions, then supervises in that shared language probability space.
The DBiLD (Dynamic Bi-directional Logits Difference) loss adaptively focuses on informative probability regions while preserving distributional structure in both directions (teacher→student and student→teacher alignment). This prevents mode-collapse to teacher distribution while still pulling the student's predictions toward the teacher's high-confidence regions.
Connection to TESSY and BLD
All three papers this week are solving variations of the teacher-student alignment problem:
- TESSY: stylistic divergence — interleave tokens to bridge generation style gap
- BLD: tokenizer divergence — bridge through byte-level interface
- Switch-KD: modality divergence — bridge through shared language probability space
They form a coherent cluster: different kinds of representation gap between teacher and student, each solved by finding a neutral "common ground" representation and distilling through it.
Research Angle
- Does the switch mechanism scale to larger teachers (7B, 13B)? The 3B→0.5B experiment is a 6x compression.
- The DBiLD bidirectional loss is interesting — standard KD is one-directional (teacher→student). What does the reverse direction add?