
Compact 1.7B multilingual embedder for resource-constrained deployments.
A workable 1.7B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
F2LLM-v2-1.7B is a general-purpose, multilingual embedding model developed by CodeFuse-AI (Ant Group). It’s part of the F2LLM-v2 family, which spans eight sizes from 80M to 14B parameters, all trained on a curated composite of 60 million publicly available high-quality text samples. The 1.7B variant is the sweet spot for practitioners who need strong multilingual retrieval and semantic understanding without the VRAM footprint of larger models.
This is not a chatbot or generative model—it’s a dense encoder optimized for feature extraction and sentence embeddings. Its primary value is producing high-quality vector representations for search, clustering, classification, and retrieval-augmented generation (RAG) pipelines, especially in environments where you cannot offload to cloud APIs. The model supports over 200 languages, with particular attention to mid- and low-resource languages that are often underserved by mainstream embedders.
F2LLM-v2-1.7B competes directly with other small multilingual embedders like multilingual-e5-small and bge-m3 at similar parameter counts. What sets it apart is its open-source transparency: Ant Group released the full training recipe, intermediate checkpoints, and data, making it a reproducible, auditable choice for production systems.
F2LLM-v2-1.7B is a dense transformer encoder with 1.7 billion parameters. It uses a decoder-only backbone (based on Qwen3) but is fine-tuned exclusively for embedding tasks via a two-stage pipeline:
The model is designed for the transformers library and integrates directly with Sentence Transformers. It uses a standard attention mechanism—no mixture-of-experts (MoE)—so inference is straightforward and predictable in terms of memory and latency.
Context length is not officially specified, but based on the Qwen3 backbone and typical embedding model defaults, you can expect at least 8192 tokens. For most embedding use cases (sentences, paragraphs, documents under 8K tokens), this is sufficient.
F2LLM-v2-1.7B excels at multilingual semantic search and cross-lingual retrieval. It achieves state-of-the-art results on 11 language-specific MTEB leaderboards, including European, Scandinavian, Indic, and East Asian languages. Key strengths:
Concrete use cases:
This model is designed for resource-constrained deployments. Here’s what you need to run it on your own hardware.
| Quantization | VRAM (approx.) | Notes |
|---|---|---|
| FP16 (full precision) | ~3.5 GB | Best accuracy, but overkill for most retrieval tasks |
| Q8_0 | ~2.0 GB | Near-lossless compression, recommended for high-precision use |
| Q4_K_M | ~1.2 GB | Sweet spot for most users—good accuracy, minimal memory |
| Q4_0 | ~1.0 GB | Slightly lower quality, fits on GPUs with 1 GB VRAM |
On an RTX 4090 with Q4_K_M, you can process 1000+ tokens/second (batch size 1). For embedding a sentence (~32 tokens), expect <1 ms latency. Throughput scales linearly with batch size up to the GPU’s memory limit.
The fastest way to get started:
1ollama run codefuse-ai/f2llm-v2-1.7b
This pulls the Q4_K_M quantized model and provides a simple API for embedding. For more control, use the transformers library with sentence-transformers:
1from sentence_transformers import SentenceTransformer2model = SentenceTransformer('codefuse-ai/F2LLM-v2-1.7B')3embeddings = model.encode(["Your text here"])
| Model | Parameters | Languages | Strengths | Tradeoffs |
|---|---|---|---|---|
| F2LLM-v2-1.7B | 1.7B | 200+ | Strong low-resource language support, open-source training data, MRL | Slightly larger than 0.6B alternatives, no generative capability |
| multilingual-e5-small | 118M | 100+ | Very small, fast | Weaker on low-resource languages, lower accuracy overall |
| bge-m3 | 567M | 100+ | Good general multilingual performance | Larger than e5-small, less transparent training data |
Choose F2LLM-v2-1.7B when you need the best accuracy per parameter for multilingual retrieval, especially if your use case includes languages like Hindi, Arabic, or Vietnamese. If you’re strictly English-only or need the smallest possible model, the 0.6B variant of F2LLM-v2 may be a better fit. For pure speed and minimal VRAM, multilingual-e5-small is still a solid option—but you’ll sacrifice accuracy on mid- and low-resource languages.