Qwen3.5-Omni Technical Report

TL;DR

Qwen3.5-Omni scales the Qwen-Omni family to hundreds of billions of parameters with 256k context and MoE architecture for both Thinker and Talker modules. Achieves SOTA across 215 audio and audio-visual benchmarks, surpassing Gemini 3.1 Pro in key audio tasks. The main technical contribution is ARIA (Adaptive Rate Interleave Alignment) — a streaming speech synthesis alignment system that fixes the token-rate mismatch between text and speech tokenizers.

Key Findings

Architecture:

Hybrid Attention MoE for both Thinker (reasoning/text) and Talker (speech synthesis)
256k context window — 10+ hours of audio or 400 seconds of 720P video at 1 FPS
Trained on heterogeneous text-vision pairs + >100M hours of audio-visual content

ARIA (Adaptive Rate Interleave Alignment): Speech synthesis in streaming contexts fails because text tokenizers and speech tokenizers run at different rates — text tokens arrive faster than speech tokens consume them, creating jitter and unnatural prosody. ARIA dynamically aligns text and speech units by interleaving them adaptively rather than at a fixed ratio. Minimal latency impact; significant prosody improvement.

New capability: Audio-Visual Vibe Coding — the model can write code based on combined audio and visual instructions simultaneously. The paper calls this an emergent capability of omnimodal training at scale.

Multilingual: 10 languages, human-like emotional nuance, zero-shot voice customization from user audio samples.

Benchmark positioning: SOTA across 215 audio-visual subtasks. Beats Gemini 3.1 Pro on key audio tasks, matches on comprehensive audio-visual. This extends Alibaba's Qwen series competitive position into the omnimodal frontier (Qwen3.6-35B already beat Gemma 4 on agentic coding, noted 04-19).

Key Technical Note: MoE + Long Context

The MoE architecture for long-context audio-visual inference is notable: instead of a dense model that processes all tokens through all parameters, MoE activates a subset of experts per token. At 256k context with video/audio tokens, this is essential — a dense model at this scale would be computationally infeasible for inference.

Relations to Prior Wiki Pages

Qwen3.6-35B / efficiency frontier (04-19): Alibaba's aggressive release cadence continues. Qwen3.6 beat Gemma 4 on agentic coding; Qwen3.5-Omni targets the omnimodal benchmark frontier. Different scales, same strategy: push the open/API frontier on multiple dimensions simultaneously.
KV Cache: 256k context in an MoE model creates novel KV cache challenges — expert-specific caches that may need different eviction policies.

Raw Source

→ raw/huggingface/2026-04-20-qwen35-omni-technical-report.md