
Sub-1B multilingual embedder; teacher for the F2LLM-v2 80M/160M/330M distilled siblings.
A solid 0.596B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
F2LLM-v2-0.6B is a general-purpose, multilingual text embedding model developed by CodeFuse-AI, a team within Ant Group and Shanghai Jiao Tong University. At 0.596 billion parameters, it occupies a specific niche: a compact but high-performing embedder that serves both as a standalone feature extractor and as the teacher model for a family of smaller distilled siblings (80M, 160M, 330M). Licensed under Apache 2.0, it is fully open — weights, training data, and intermediate checkpoints are all released.
The model is part of the F2LLM-v2 series, which spans eight sizes from 80M to 14B parameters and was trained on a curated composite of 60 million publicly available high-quality text pairs. The 0.6B version is the smallest of the “base” models (the others being 1.7B, 4B, 8B, 14B) and is the one you’d reach for when you need strong multilingual embedding quality but have tight memory or inference budget.
Practitioners should care because this model breaks the English-centric barrier: it supports more than 200 languages, with explicit emphasis on mid- and low-resource languages that are often poorly served by other embedding models. For retrieval-augmented generation (RAG), semantic search, or clustering pipelines that need to handle multilingual content on local hardware, F2LLM-v2-0.6B is a strong candidate.
---
F2LLM-v2-0.6B is a dense transformer model with 0.596 billion parameters. It uses a standard encoder-only architecture (based on Qwen3 backbone, per the training code) with a classification head for embedding output. The model is designed for feature extraction (pipeline_tag: feature-extraction on Hugging Face) and is optimized for use with sentence-transformers.
Key architectural points:
Dense architecture means all 0.6B parameters are active during every forward pass. This is straightforward to run — no expert routing overhead, no variable memory depending on input. VRAM usage is predictable: roughly 1.2 GB at FP16 for the model weights, plus a small overhead for activations (typically <0.5 GB for batch size 1). That puts it well within reach of any consumer GPU with 4 GB VRAM or more.
---
F2LLM-v2-0.6B is a text-only embedding model. It does not generate text; it maps input text to a dense vector that captures semantic meaning. Its primary capabilities are:
---
This is where the model shines: you can run it on consumer hardware without breaking a sweat.
| Precision | Model Weights | Example GPU |
|---|---|---|
| FP32 | ~2.4 GB | RTX 3060 (12GB), GTX 1080 Ti |
| FP16 / BF16 | ~1.2 GB | RTX 2060 (6GB), M1 Macs |
| Q8_0 (int8) | ~0.6 GB | Any GPU with 4 GB VRAM, or CPU with 8 GB RAM |
| Q4_K_M (int4) | ~0.35 GB | Integrated GPUs, low-power devices |
Recommended setup: For most users, run at FP16 on a GTX 1660 Super (6 GB) or better. That leaves ample VRAM for batch processing or running a small LLM alongside. If you're on a laptop with an RTX 3050 (4 GB), use Q8_0 quantization — the performance hit on embeddings is usually negligible (<1% on MTEB tasks).
mps backend. Expect comparable throughput to an RTX 3060.Testing on a single RTX 3090 (FP16, batch size 1):
| Input Length | TPS (approx) |
|---|---|
| 128 tokens | 2000+ |
| 512 tokens | 1500+ |
| 1024 tokens | 900+ |
These are high because embedding models are efficient. Even on a laptop RTX 3050 at Q8, expect at least 300-500 tokens/sec for typical sentence-length inputs.
As of early 2025, Ollama supports embedding models natively. Check if codefuse-ai/f2llm-v2-0.6b is available, or import the model from Hugging Face.
1# Pull from Hugging Face or local file2ollama pull codefuse-ai/f2llm-v2-0.6b3# Generate embedding4ollama embed codefuse-ai/f2llm-v2-0.6b "Your text here"
Alternatively, use sentence-transformers directly:
1from sentence_transformers import SentenceTransformer2model = SentenceTransformer('codefuse-ai/F2LLM-v2-0.6B')3embeddings = model.encode(["Your text"], normalize_embeddings=True)
---
F2LLM-v2-0.6B fills a specific gap: a small, locally runnable embedding model with world-class support for underrepresented languages. If you’re building a multilingual RAG system that must handle, say, Bengali, Ukrainian, and Yoruba alongside English, this is the model to test first. Its open license and MRL flexibility make it a practical choice for real-world deployments where storage and compute are constrained but quality cannot be sacrificed.