
Causal Gemma3-1B turned into a strong bidirectional embedder via masking-then-contrastive adaptation.
A workable 1B-parameter dense embedding model from BidirLM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.10 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5070 TiVast.ai · On-Demand · 16 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3080RunPod · Community · 10 GB VRAM | $0.17 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
BidirLM-1B-Embedding is a dense 1‑billion‑parameter bidirectional text embedding model, adapted from the causal decoder Gemma3‑1B. Developed by BidirLM and released under the Gemma License, it converts a standard generative LLM into a strong encoder using a two‑stage recipe: masked next‑token prediction (MNTP) to unlock bidirectional attention, followed by contrastive fine‑tuning on multilingual data. The result is a compact embedder that scores 62.1 on MTEB Multilingual V2 (mean task) — competitive with much larger alternatives — while staying small enough to run on consumer hardware.
This model fills the gap between BERT‑style encoders (which cap out around 300M params) and large embedding models (which require datacenter GPUs). It inherits the rich representation knowledge of Gemma3, then specializes it for dense retrieval, semantic similarity, classification, and downstream fine‑tuning — all without needing cloud APIs.
model.max_seq_length or max_length accordingly) The adaptation process is critical: unlike contrastive‑only models, BidirLM first trains a Fill‑Mask checkpoint via MNTP, which teaches the causal LLM to attend bidirectionally. This step prevents catastrophic forgetting and makes the encoder effective for token‑level tasks (NER, classification) as well as generic embedding benchmarks. The final embedding model (BidirLM-1B-Embedding) is the MNTP checkpoint further tuned with contrastive losses.
Because it’s a dense model, every parameter is active during inference. This means VRAM consumption scales linearly with parameter count: a 1B model at FP16 requires roughly 2 GB of GPU memory, making it far more memory‑efficient than MoE variants of the same size.
BidirLM-1B-Embedding excels at two broad categories of work: generic text embeddings (via Sentence Transformers) and downstream fine‑tuning (via HuggingFace Transformers).
Because the model was pre‑trained with MNTP, it retains strong token‑level understanding. You can fine‑tune it directly for:
Concrete use cases: building a multilingual semantic search engine for 100+ languages, fine‑tuning a custom NER pipeline on legal documents, or replacing a heavier embedding model in a retrieval‑augmented pipeline on a single RTX 4090.
This is where the model shines: you can run it on a single consumer GPU without compromises.
Tokens‑per‑second depends on hardware and quantization. Rough estimates for a single sequence (512 tokens):
Processing documents in batches amplifies throughput linearly up to VRAM capacity.
The easiest way to run it locally is via [Ollama](https://ollama.com). While an official model may not be available on day one, you can pull the GGUF conversion from HuggingFace community repos or convert the model yourself with llama.cpp. Alternatively, use Sentence Transformers directly:
1from sentence_transformers import SentenceTransformer23model = SentenceTransformer("BidirLM/BidirLM-1B", trust_remote_code=True)4embeddings = model.encode(["Your text here"])
Note the trust_remote_code=True requirement – this is mandatory because BidirLM uses custom modeling code to enable bidirectional attention.
| Model | Parameters | MTEB Multi. V2 | Context | Multilingual | License |
|---|---|---|---|---|---|
| BidirLM-1B-Embedding | 1.0B | 62.1 | 512 (max 32K) | 140+ | Gemma License |
| BGE‑M3 (BAAI) | 567M | ~59.5 | 8192 | 100+ | MIT |
| Stella‑400M | 400M | ~57.0 | 512 | English‑focused | MIT |
When to choose BidirLM‑1B: If you need the highest multilingual embedding quality at the 1B scale, especially for languages beyond English. The MNTP pre‑training also makes it uniquely suited for fine‑tuning on token‑level tasks – BGE‑M3 and Stella are contrastive‑only and lack this capability. The 32K token ceiling (via Gemma3 backbone) also allows longer documents when you increase max_seq_length, though MTEB scores are only validated at 512.
When to choose an alternative: If you’re constrained to a sub‑500M model (e.g., edge devices), Stella‑400M or BGE‑Small offer lower VRAM. For English‑only retrieval at higher throughput, BGE‑M3 is slightly faster and permissively licensed. BidirLM’s Gemma License imposes usage restrictions (see the license for details) – if Apache‑2.0 or MIT is mandatory, the alternatives are safer.