Switch-KD: Visual-Switch Knowledge Distillation for VLMs

Date: 2026-04-18
Tier: 1 — Distillation (multimodal extension)
arXiv: 2604.14629
Raw: source

TL;DR

VLM distillation fails because visual and language supervision are applied in separate modality silos — the student gets misaligned multimodal knowledge. Switch-KD forces the transfer through a single shared space: the text probability distribution. It "switches" the student's visual outputs into the teacher's language pathway, then uses a bidirectional logit-difference loss. A 0.5B TinyLLaVA distilling from a 3B teacher gains 3.6 points averaged across 10 benchmarks with no architecture changes.

Mechanism

Standard VLM distillation supervises vision and language outputs separately. But in VLMs, visual representations are fused inside the language space anyway — the late-fusion architecture means visual tokens become language-space vectors before any generation step. Switch-KD exploits this: it routes the student's visual features through the teacher's language pathway to build cross-modal reference distributions, then supervises in that shared language probability space.

The DBiLD (Dynamic Bi-directional Logits Difference) loss adaptively focuses on informative probability regions while preserving distributional structure in both directions (teacher→student and student→teacher alignment). This prevents mode-collapse to teacher distribution while still pulling the student's predictions toward the teacher's high-confidence regions.

Connection to TESSY and BLD

All three papers this week are solving variations of the teacher-student alignment problem:

TESSY: stylistic divergence — interleave tokens to bridge generation style gap
BLD: tokenizer divergence — bridge through byte-level interface
Switch-KD: modality divergence — bridge through shared language probability space

They form a coherent cluster: different kinds of representation gap between teacher and student, each solved by finding a neutral "common ground" representation and distilling through it.

Research Angle

Does the switch mechanism scale to larger teachers (7B, 13B)? The 3B→0.5B experiment is a 6x compression.
The DBiLD bidirectional loss is interesting — standard KD is one-directional (teacher→student). What does the reverse direction add?

Switch-KD: Visual-Switch Knowledge Distillation for VLMs

Switch-KD: Visual-Switch Knowledge Distillation for VLMs

TL;DR

Mechanism

Connection to TESSY and BLD

Research Angle

Related Pages