MMProLong: training long-context vision-language models with generalization beyond 128K
Source: HuggingFace Daily Papers · 2026-05-14 Paper: arXiv 2605.13831 Raw: raw Tier: 2. Long-context VLMs, training recipes, generalization
TL;DR
A systematic study of long-context continued pre-training for vision-language models, extending a 7B model from 32K to 128K context. Three findings: long-document VQA is substantially more effective than OCR transcription; balanced sequence-length distribution beats target-length-focused training (better generalization at unseen lengths); retrieval remains the primary bottleneck, favoring retrieval-heavy data mixtures. The resulting model, MMProLong, comes from Qwen2.5-VL-7B with only 5B tokens of long-context pretraining, gains 7.1% on long-document VQA, and maintains strong performance at 256K and 512K well beyond its 128K training window. It also transfers without supervision to webpage-multimodal needle retrieval, long-context vision-text compression, and long-video understanding.
Why it matters
The wiki has been tracking long-context training and serving on two tracks: pretraining recipes (this paper, Lighthouse Attention) and inference-time selectivity (Make Each Token Count, MISA, UniPrefill). MMProLong is the cleanest long-context-VLM training recipe in the wiki and one of the more useful negative results on data mix (target-length focused is worse than balanced).
Mechanism
Three ablations, each operationalized:
Pre-training data mix Headline finding
───────────────────────── ───────────────────────────────────
Long-document VQA vs OCR ───► VQA wins. The reasoning task pulls
more long-context capability than
transcription.
128K-focused vs balanced ───► Balanced wins. Generalization
beyond train length requires
diverse retrieval positions.
Retrieval-heavy vs reasoning ──► Retrieval-heavy wins. Long-context
bottleneck is finding the relevant
chunk, not reasoning over it.
The result is a tight recipe: extend with VQA, balance the lengths, weight retrieval. Total cost: 5B tokens on a 7B base. The transfer claims (256K, 512K beyond 128K training; webpage needle retrieval; vision-text compression; long-video understanding without task-specific training) suggest the recipe is buying genuinely transferable long-context capability rather than memorizing target-length artifacts.
Connections
- Make Each Token Count (2026-05-12) said the full cache is dilutive once context is long enough. MMProLong gives the training-side complement: training data should also be diverse-length to teach the model to find signal at any position. The two papers together say "long context needs balance at both training and inference time."
- Lighthouse Attention (NousResearch, @omarsar0 retweet 2026-05-12, paper 2605.06554) speeds long-context training via a removable subquadratic wrapper. MMProLong is orthogonal — same training, different mix. The natural composition: train with Lighthouse Attention's wrapper and MMProLong's balanced retrieval-heavy mix.
- r/LocalLLaMA Qwen 3.6 long-context practitioner reports (1tcc7h5: 24 tok/s on GTX 1080 with KV cache quantization) confirm long-context inference is now a tractable deployment target. MMProLong's training recipe is what gets you a long-context model worth deploying.
Research angle
- Recipe transfer across base models. MMProLong is built on Qwen2.5-VL-7B. The transfer question: does the same recipe (balanced length distribution, retrieval-heavy, VQA-driven) work on Llama-VL, InternVL, or LLaVA-OneVision at the same parameter scale? If yes, this becomes the default long-context VLM recipe.
- Beyond VLMs. Long-document VQA is the VLM analogue of long-document QA in text-only models. Whether the balanced-mix finding transfers to text-only long-context training is a one-experiment check, but with large practical implications for the broader long-context training literature.
- Composability with diffusion-decoder approaches. Orthrus (same day) accelerates inference via parallel decoding. Whether an MMProLong-trained model retains its long-context behavior under Orthrus drafting is an open composition question.
Where it lives
Update kv-cache.md — first long-context VLM training recipe in the wiki. Cross-reference with knowledge-distillation.md — the retrieval-heavy mix finding implies that long-context distillation should weight retrieval-style examples too.