BidirLM

BidirLM-Omni-2.5B-Embedding

A 2.5B omnimodal text+vision+audio encoder built by merging Qwen3 specialists.

2.4B paramsDense

A workable 2.4B-parameter dense embedding model from BidirLM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Model Specifications

Parameters2.4B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

61.0

Classification

65.9

Clustering

51.5

STS

75.7

MBA Open Score

50.1CC

Benchmark60%

63.5

Popularity25%

12.2

Efficiency15%

59.3


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.8 GB
Acer Veriton GN100 AI MiniAcer	SS	1.8 GB
AMD Instinct MI300XAMD	SS	1.8 GB
AMD Instinct MI325XAMD	SS	1.8 GB
AMD Instinct MI355XAMD	SS	1.8 GB
AMD Radeon RX 7600 8GBAMD	SS	1.8 GB
AMD Radeon RX 7700 XTAMD	SS	1.8 GB
AMD Radeon RX 7800 XTAMD	SS	1.8 GB
AMD Radeon RX 7900 XTAMD	SS	1.8 GB
AMD Radeon RX 7900 XTXAMD	SS	1.8 GB
AMD Radeon RX 9070AMD	SS	1.8 GB
AMD Radeon RX 9070 XTAMD	SS	1.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.8 GB
Apple M4Apple	SS	1.8 GB
Apple M4 Max (40-core GPU)Apple	SS	1.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.8 GB
Apple M5Apple	SS	1.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.8 GB

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.03
NVIDIA GeForce RTX 5060 Ti

About This Model

Overview

BidirLM-Omni-2.5B-Embedding is a 2.4 billion parameter bidirectional encoder that produces fixed-size embeddings from text, images, and audio — and aligns them into a single 2048-dimensional representation space. Developed by BidirLM, this model is the omnimodal member of the BidirLM family, built by adapting and merging specialized Qwen3 causal decoders into a unified encoder with bidirectional attention.

This model targets developers who need cross-modal retrieval, semantic similarity, or clustering without relying on cloud APIs. Unlike standard text-only embedding models, BidirLM-Omni directly encodes images and audio alongside text, making it suitable for applications like multimodal search, content-based recommendation, and zero-shot classification across modalities. Its 2.4B parameter count places it in a sweet spot: large enough to capture nuanced representations, small enough to run on consumer hardware with quantization.

Architecture & Technical Details

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

BidirLM-Omni is a dense transformer encoder — not a mixture of experts. All 2.4B parameters are active during inference, which simplifies memory management: there is no routing overhead or uneven expert loading. The architecture is derived from Qwen3 causal decoders that have been converted to bidirectional encoders using a two-phase adaptation: (1) a prior masking phase that unlocks bidirectional attention, followed by (2) contrastive training on a multi-domain data mixture. Model weights from specialized vision and audio decoders are then merged linearly, transferring modality-specific capabilities without retraining from scratch.

Parameters: 2.4B (dense, all active)
Embedding dimension: 2048
Context window: 32,768 tokens for text (based on the underlying Qwen3 model)
Input modalities: text (string), images (PIL image), audio (numpy array + sample rate)
Supported languages: 119 languages inherited from Qwen3, reinforced with contrastive data for 87 languages
License: Apache 2.0
Hardware requirement note: Requires cuDNN > 9.20.0 to avoid a Conv3D performance regression on NVIDIA GPUs.

The 32k token context is generous for embedding tasks — long documents, full conversations, or multi-image sequences can be processed in a single forward pass. Images are internally resized to a fixed resolution; audio is resampled to 16 kHz.

BidirLM-Omni-2.5B is a multilingual, multimodal embedding model. Its primary value is cross-modal retrieval and similarity: you can encode a text query and compare it directly to image embeddings, or find audio clips that match a textual description. This is possible because all modalities project into the same 2048-dimensional space.

Text-text similarity: Standard sentence-level semantic similarity across 119 languages. Outperforms many text-only encoders of similar size on MTEB multilingual benchmarks.
Text-image retrieval: Given a text description, retrieve relevant images from a database (e.g., product search, stock photo matching). Images are encoded directly without a separate vision tower — the model fuses vision knowledge via weight merging.
Text-audio retrieval: Encode audio clips (speech, music, environmental sounds) and compare with text queries. Works with any sample rate; the model resamples to 16 kHz internally.
Mixed-modality prompts: Interleave text with images or audio in a single conversational prompt using the chat template. Useful for multi-turn retrieval or captioning evaluation.

Multimodal semantic search for e-commerce (search by text, find images and audio descriptions)
Content moderation across media types
Large-scale clustering of mixed media datasets
Cross-lingual retrieval with visual context (e.g., find a product shown in an image across 50 languages)
Audio-based retrieval in podcasts or music libraries

For text-only downstream tasks (classification, NER, regression), you can fine-tune the encoder via the Transformers library — the bidirectional attention makes it a drop-in replacement for BERT-like models.

This model is designed for local inference. The 2.4B parameter count means it fits comfortably on consumer GPUs, especially with 4-bit quantization.

VRAM: ~5 GB for the model weights (2.4B × 2 bytes = 4.8 GB) plus ~1 GB for activations and overhead → 6 GB total
GPU: NVIDIA RTX 3060 (12 GB), RTX 4060 Ti (16 GB), or any card with at least 8 GB. Also works on Apple Silicon with 16 GB unified memory (M1 Pro/Max or newer) using PyTorch MPS backend.
RAM: 16 GB (32 GB recommended for processing long text or large images)

VRAM: ~3 GB for weights + overhead → 4 GB total
GPU: RTX 3050 (8 GB), RTX 2060, GTX 1660 Super, or any card with 6 GB VRAM. On Apple Silicon, 8 GB unified memory suffices.
Peak performance: RTX 4090 can encode ~800–1,200 tokens per second (text) in FP16; with Q4_K_M, throughput increases to ~1,500 t/s.

The model is available in float16 and int8 (via bitsandbytes or AutoGPTQ). Q4_K_M (4-bit) offers the best trade-off for consumer hardware with minimal accuracy loss on embedding tasks (MTEB scores drop <1% compared to FP16).
Avoid Q2_K or Q3_K for retrieval tasks — you may see significant recall degradation.

Install sentence-transformers and torch with CUDA support.
Load the model:

model = SentenceTransformer("BidirLM/BidirLM-Omni-2.5B-Embedding", trust_remote_code=True)

emb = model.encode("a photo of a cat")

emb_img = model.encode(PIL.Image.open("cat.jpg"))

For quantized inference, use bitsandbytes to load in 8-bit or 4-bit. The model is also integrated with Hugging Face transformers for fine-tuning.

cuDNN warning: If you see very slow inference (seconds per image), upgrade cuDNN to version 9.20.1 or later. Older versions trigger a known NVIDIA bug that makes Conv3D operations 10–100x slower.

BidirLM-Omni-2.5B competes with other multilingual embedding models at the 2–3B parameter scale. Here’s how it stacks up against two realistic alternatives.

Model	BidirLM-Omni-2.5B	BGE-M3 (2.4B)	E5-Mistral-7B
Modality	Text + Image + Audio	Text (dense + sparse)	Text
Languages	119	100+	10 (English primarily)
Embedding dim	2048	1024 (dense)	4096
Context length	32k tokens	8192 tokens	32768 tokens
License	Apache 2.0	MIT	MIT
VRAM (FP16)	~5 GB	~5 GB	~14 GB
VRAM (Q4)	~3 GB	~3 GB	~7 GB

You need cross-modal retrieval (text ↔ image ↔ audio) in a single model. BGE-M3 and E5 are text-only.
Your data spans many languages (80+) and you want a single encoder without per-language models.
You want to run on a mid-range GPU with 8 GB VRAM — E5-Mistral-7B requires more memory unless heavily quantized.

BGE-M3 offers both dense and sparse embeddings (lexical + semantic), which can improve hybrid search. It also supports late interaction for fine-grained scoring.
E5-Mistral-7B (or the 2.4B variant) may yield slightly higher accuracy on monolingual English retrieval tasks due to larger model capacity and more training data.
If you only need text and want full support from established tools (elasticsearch, Milvus), BGE-M3 has wider ecosystem integration today.

BidirLM-Omni’s defining advantage is its multimodal native design. Few alternatives at this size encode images and audio without separate encoders, and none achieve competitive performance on all three modalities simultaneously. If your workload mixes media types, this model is the pragmatic choice for local deployment.

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

BidirLM-Omni-2.5B-Embedding

Our Take

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

MBA Open Score

Hardware Compatibility

Rent in the Cloud

About This Model

Overview

Architecture & Technical Details

Related Models

BidirLM-1.7B-Embedding

BidirLM-1B-Embedding

The AI Build Report

Community

Capabilities & Use Cases

Running BidirLM-Omni-2.5B-Embedding Locally

How It Compares