A 2.5B omnimodal text+vision+audio encoder built by merging Qwen3 specialists.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
An omnimodal embedding model assembled by linearly merging three Qwen3-1.7B specialist text backbones (a Bi+MNTP text adapter, Qwen3-ASR for audio, and Qwen3-VL for vision) and bolting on each specialist's modality head. After contrastive training on a 1.8M multimodal pair corpus, it sets a new omnimodal Pareto frontier, outperforming Nemotron-Omni-3B by +17 on text and +5 on image at roughly half the parameter count.