
A 2.5B omnimodal text+vision+audio encoder built by merging Qwen3 specialists.
A workable 2.4B-parameter dense embedding model from BidirLM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
BidirLM-Omni-2.5B-Embedding is a 2.4 billion parameter bidirectional encoder that produces fixed-size embeddings from text, images, and audio — and aligns them into a single 2048-dimensional representation space. Developed by BidirLM, this model is the omnimodal member of the BidirLM family, built by adapting and merging specialized Qwen3 causal decoders into a unified encoder with bidirectional attention.
This model targets developers who need cross-modal retrieval, semantic similarity, or clustering without relying on cloud APIs. Unlike standard text-only embedding models, BidirLM-Omni directly encodes images and audio alongside text, making it suitable for applications like multimodal search, content-based recommendation, and zero-shot classification across modalities. Its 2.4B parameter count places it in a sweet spot: large enough to capture nuanced representations, small enough to run on consumer hardware with quantization.
BidirLM-Omni is a dense transformer encoder — not a mixture of experts. All 2.4B parameters are active during inference, which simplifies memory management: there is no routing overhead or uneven expert loading. The architecture is derived from Qwen3 causal decoders that have been converted to bidirectional encoders using a two-phase adaptation: (1) a prior masking phase that unlocks bidirectional attention, followed by (2) contrastive training on a multi-domain data mixture. Model weights from specialized vision and audio decoders are then merged linearly, transferring modality-specific capabilities without retraining from scratch.
Key specs:
The 32k token context is generous for embedding tasks — long documents, full conversations, or multi-image sequences can be processed in a single forward pass. Images are internally resized to a fixed resolution; audio is resampled to 16 kHz.
BidirLM-Omni-2.5B is a multilingual, multimodal embedding model. Its primary value is cross-modal retrieval and similarity: you can encode a text query and compare it directly to image embeddings, or find audio clips that match a textual description. This is possible because all modalities project into the same 2048-dimensional space.
Specific capabilities:
Concrete use cases:
For text-only downstream tasks (classification, NER, regression), you can fine-tune the encoder via the Transformers library — the bidirectional attention makes it a drop-in replacement for BERT-like models.
This model is designed for local inference. The 2.4B parameter count means it fits comfortably on consumer GPUs, especially with 4-bit quantization.
Minimum hardware requirements (FP16):
Recommended setup for most users (Q4_K_M quantization):
Quantization notes:
float16 and int8 (via bitsandbytes or AutoGPTQ). Q4_K_M (4-bit) offers the best trade-off for consumer hardware with minimal accuracy loss on embedding tasks (MTEB scores drop <1% compared to FP16).Fastest way to get started:
sentence-transformers and torch with CUDA support. model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True)
emb = model.encode("a photo of a cat")
emb_img = model.encode(PIL.Image.open("cat.jpg"))
For quantized inference, use bitsandbytes to load in 8-bit or 4-bit. The model is also integrated with Hugging Face transformers for fine-tuning.
cuDNN warning: If you see very slow inference (seconds per image), upgrade cuDNN to version 9.20.1 or later. Older versions trigger a known NVIDIA bug that makes Conv3D operations 10–100x slower.
BidirLM-Omni-2.5B competes with other multilingual embedding models at the 2–3B parameter scale. Here’s how it stacks up against two realistic alternatives.
| Model | BidirLM-Omni-2.5B | BGE-M3 (2.4B) | E5-Mistral-7B |
|---|---|---|---|
| Modality | Text + Image + Audio | Text (dense + sparse) | Text |
| Languages | 119 | 100+ | 10 (English primarily) |
| Embedding dim | 2048 | 1024 (dense) | 4096 |
| Context length | 32k tokens | 8192 tokens | 32768 tokens |
| License | Apache 2.0 | MIT | MIT |
| VRAM (FP16) | ~5 GB | ~5 GB | ~14 GB |
| VRAM (Q4) | ~3 GB | ~3 GB | ~7 GB |
When to choose BidirLM-Omni:
When to choose an alternative:
BidirLM-Omni’s defining advantage is its multimodal native design. Few alternatives at this size encode images and audio without separate encoders, and none achieve competitive performance on all three modalities simultaneously. If your workload mixes media types, this model is the pragmatic choice for local deployment.
| $0.07 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.08 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.09 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.