
A 1.7B bidirectional encoder distilled from causal Qwen3-1.7B via masking + contrastive adaptation.
A workable 1.7B-parameter dense embedding model from BidirLM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.07 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.08 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.09 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
BidirLM-1.7B-Embedding is a 1.7 billion parameter bidirectional encoder designed for text representation. It is the result of a focused distillation process: the team at BidirLM took a causal decoder (Qwen3-1.7B) and transformed it into an efficient encoder using a two-stage pipeline – masked next-token prediction (MNTP) followed by contrastive adaptation. The outcome is a dense, text-only model that achieves a mean MTEB Multilingual V2 score of 62.9, competitive with open-source embedding models at twice its size.
This model fits into the growing category of “encoder-only” systems derived from large language models, offering a practical alternative to classic BERT-style architectures. Unlike many embedding models that rely solely on contrastive learning, BidirLM-1.7B-Embedding first undergoes a masking phase that preserves the model’s ability to be fine-tuned on downstream tasks such as NER, classification, and NLI. For developers who need a single, multilingual embedding model that also works as a backbone for task-specific fine-tuning, this is a strong candidate.
BidirLM-1.7B-Embedding is a dense model with 1.7 billion parameters – all weights are active during every forward pass. This contrasts with mixture-of-experts (MoE) models, where only a subset of parameters are used per token, often leading to lower VRAM requirements for inference but higher memory for fine-tuning. A dense 1.7B model strikes a balance: it is small enough to run on consumer hardware yet large enough to capture nuanced multilingual representations.
The architecture is built on the Qwen3 transformer with an embedding dimension of 2048 and a maximum token limit of 512 for MTEB evaluation. However, the underlying Qwen3 backbone supports up to 32,768 tokens in principle. You can increase model.max_seq_length in Sentence Transformers or adjust max_length in the tokenizer to handle longer documents – though doing so will increase VRAM usage and may degrade retrieval performance if the model was not trained on longer sequences.
The model uses a custom trust_remote_code=True flag when loading via Hugging Face’s transformers or sentence-transformers libraries. It outputs vectors of size 2048, which is on par with other modern embedding models like BGE-M3 (1024) and multilingual-e5 (1024). The larger dimension can improve retrieval fidelity but also increases storage and memory for index vectors.
BidirLM-1.7B-Embedding is primarily a sentence and document embedding model – it converts text into dense vectors that can be compared for similarity, clustering, classification, and retrieval. It supports the full suite of MTEB tasks: semantic textual similarity (STS), clustering, pair classification, reranking, bitext mining, and multi-label classification. Additionally, it can be fine-tuned for sequence classification (e.g., NLI, sentiment) and token classification (e.g., NER) via the standard transformers API.
Key strengths:
Concrete use cases:
This model is well-suited to local deployment. At 1.7B parameters and full (float32) precision, it requires roughly 6.8 GB of GPU memory for the weights alone. Most practitioners will use a quantized version to reduce memory and improve throughput.
VRAM requirements by quantization:
| Precision | VRAM (approx) | Use case |
|---|---|---|
| FP16 | ~3.4 GB | Recommended for maximum accuracy on RTX 4090, A-series, or M-series with >8 GB |
| Q4_K_M (4-bit) | ~1.1 GB | Works on 6–8 GB GPUs; minimal accuracy loss for embeddings |
| Q8_0 (8-bit) | ~1.7 GB | Good balance for lower-end GPUs |
Consumer hardware that can run it:
Recommended quantization: For most users, Q4_K_M offers the best trade-off of memory, speed, and quality. The 4-bit quantized model loads in about 1.1 GB, leaving room for tokenizer, cache, and running multiple instances.
Expected tokens per second: On a single RTX 4090 with sequence length 512 and batch size 1, you can expect roughly 120–180 tokens/sec in FP16. With Q4_K_M, this improves to 200–250 tokens/sec. For longer sequences (e.g., 2048 tokens), throughput drops to 40–60 tokens/sec.
Quickstart with Ollama: Although not officially packaged, you can convert the model to GGUF format and run it via ollama. Use a tool like llama.cpp to quantize and create a Modelfile. Alternatively, the fastest path is with Sentence Transformers:
1from sentence_transformers import SentenceTransformer2model = SentenceTransformer("BidirLM/BidirLM-1.7B-Embedding", trust_remote_code=True)3embeddings = model.encode(["Hello, world"])
BidirLM-1.7B-Embedding vs. BGE-M3 (BAAI/bge-m3)
BidirLM-1.7B-Embedding vs. multilingual-e5-large (intfloat/multilingual-e5-large)
BidirLM-1.7B-Embedding vs. BidirLM-0.6B
Choose BidirLM-1.7B-Embedding when you need the best available open-source multilingual embedding from a 1.7B class model, with the ability to fine-tune for downstream tasks without switching frameworks.