
Large-scale 8B multilingual embedder delivering near-flagship quality at lower inference cost.
A workable 7.6B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
F2LLM-v2-8B is a general-purpose, multilingual embedding model developed by CodeFuse-AI (Ant Group). At 7.6B parameters (dense architecture), it is one of the largest publicly available text embedding models designed for local inference. Its primary use case is generating high-quality dense vector representations for retrieval-augmented generation (RAG), semantic search, clustering, and classification across more than 200 languages.
The model is the second generation of the F2LLM family, trained on a curated composite of 60 million publicly available high-quality examples. Unlike many embedding models that cap out at a few hundred million parameters, F2LLM-v2-8B pushes into a scale that previously required proprietary APIs. It delivers near-flagship embedding quality at an inference cost that fits on a single consumer GPU—making it a practical choice for developers who need state-of-the-art multilingual retrieval without paying per-query API bills.
F2LLM-v2-8B is released under the Apache 2.0 license. The family includes base (Preview) and instruct variants; for production use, the instruct version (codefuse-ai/F2LLM-v2-8B) is recommended.
F2LLM-v2-8B is a dense transformer encoder, not a decoder-only generative model. It uses a bidirectional attention mechanism optimized for producing fixed-size embeddings from variable-length text inputs. The architecture is built on top of Qwen3 backbone (as indicated in the training code) and supports Matryoshka Representation Learning (MRL) and knowledge distillation—both included as new features in V2.
feature-extraction – designed for use with libraries like HuggingFace Transformers and Sentence TransformersBecause it is a dense model at 7.6B, peak VRAM at FP16 precision is approximately 15.2 GB. This is a key consideration for local deployment. However, quantization reduces the footprint significantly—see the Running Locally section for concrete numbers.
F2LLM-v2-8B is an embedding model, not a chat or completion model. Its strength lies in converting text into high-quality vectors that capture semantic, multilingual, and cross-lingual relationships.
Primary use cases:
Concrete example: A developer building a support bot that serves users in Spanish, Arabic, and Vietnamese can use F2LLM-v2-8B to index their FAQ database once and serve queries in any language. The same pipeline works for documentation retrieval, product catalogs, or internal knowledge bases.
This model is sized to run on consumer hardware with careful quantization choices. Here are the real-world specs.
| Precision | Approximate VRAM | Notes |
|---|---|---|
| FP16 | ~15.2 GB | Full quality; requires a 16GB+ GPU (RTX 4080/4090, A4000, A5000, M4 Max with 24GB+) |
| Q4_K_M | ~4.8 GB | Recommended default – minimal quality loss, fits 8GB cards |
| Q5_K_M | ~5.6 GB | Slightly higher quality, still fits 8GB cards |
| Q8_0 | ~8.0 GB | Good trade-off if you have 8–12GB VRAM |
Performance depends heavily on sequence length and batch size. For a typical batch of 1 query (512 tokens) with a single embedding:
For indexing large corpora, batching (e.g., batch size 32) multiplies throughput linearly until memory is exhausted.
The fastest way to run F2LLM-v2-8B locally is via Ollama (if a GGUF conversion is available) or directly with the HuggingFace transformers library using the Sentence Transformers integration. The model card on HuggingFace provides a ready-to-run snippet:
1from sentence_transformers import SentenceTransformer23model = SentenceTransformer("codefuse-ai/F2LLM-v2-8B")4embeddings = model.encode(["Your text here"])
For quantized GGUF versions, use llama.cpp or Ollama after converting the model with convert.py.
F2LLM-v2-8B competes in the league of large, open-source multilingual embedding models. The most direct alternatives are:
When to choose F2LLM-v2-8B: You need high multilingual quality, especially for low-resource languages, and have a GPU with at least 8GB VRAM. You want full control over the embedding pipeline and no API dependency.
When to choose a smaller model: If you are constrained to a CPU or an 8GB GPU and must prioritize throughput over maximum accuracy, models like BGE-M3 or multilingual-e5-large are more practical.
Trade-off summary: F2LLM-v2-8B delivers state-of-the-art multilingual embeddings at a size that is demanding but feasible for a single consumer GPU. It is the best open choice for developers who need an on-premises alternative to large API-based embedders and are willing to allocate 5–8GB VRAM for quantization.