
Flagship 14B multilingual embedding model from CodeFuse-AI; SOTA on 11/17 MTEB benchmarks.
A workable 14B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
F2LLM-v2-14B is the flagship model in CodeFuse-AI’s F2LLM-v2 family, built by Ant Group. It is a dense 14-billion-parameter text embedding model, not a generative language model. Its purpose is to convert any piece of text—short queries, documents, or code—into a dense vector that captures semantic meaning. This makes it the backbone for retrieval-augmented generation (RAG), semantic search, clustering, and classification workflows that run entirely on your own hardware.
Where this model matters most is multilingual embedding. F2LLM-v2-14B achieves state-of-the-art results on 11 out of 17 MTEB benchmarks, covering English, European, Scandinavian, Indic, East Asian, and many mid- to low-resource languages. It competes directly with models like gte-Qwen2-7B-instruct, multilingual-e5-large-instruct, and bge-multilingual-gemma2, but at a larger parameter count and with broader language coverage (200+ languages). If your pipeline needs to embed text in Arabic, Vietnamese, Persian, or dozens of other languages without sacrificing English performance, this model is the current leader.
F2LLM-v2-14B is a dense transformer—no mixture-of-experts routing, no conditional computation. Every forward pass activates all 14B parameters. This means VRAM consumption is predictable: at full precision (FP16), the model alone requires ~28 GB of memory. With typical overhead for key-value caches and batch processing, expect ~32 GB for single-sequence inference.
The model uses a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation (as described in the F2LLM-v2 technical report). Matryoshka learning allows the model to output variable-length embeddings (e.g., 256, 512, 1024 dimensions) while maintaining strong performance at shorter lengths—useful when you need to reduce storage or search latency. The architecture is derived from the F2LLM-v2-Preview base model and fine-tuned for instruction-following embedding tasks.
Context length is not specified by the provider, but typical embedding models in this class support 512–8192 tokens. Given the multilingual training data (60 million curated samples), it can handle long documents, though you should benchmark for your specific retrieval corpus.
F2LLM-v2-14B is an embedding model, not a chatbot. Its primary capability is dense vector representation, optimized for semantic similarity and retrieval. Concrete use cases:
The model supports instruction prefixes (e.g., "Represent this sentence for retrieval: {text}") via the Sentence Transformers library, which can improve task-specific accuracy.
This is where hardware considerations matter. At 14B parameters, F2LLM-v2-14B is a mid-range heavyweight for local embedding servers. Here’s what you need:
VRAM requirements by quantization:
Consumer GPUs and expected throughput:
Fastest way to get started: Use Ollama with the F2LLM-v2-14B tag (if available) or load the model via Sentence Transformers directly from Hugging Face. The official model card provides a code snippet for sentence-transformers usage. For production, consider vLLM with embedding endpoints or a custom ONNX export.
Hardware requirements summary:
vs. gte-Qwen2-7B-instruct (7B dense, multilingual)
F2LLM-v2-14B outperforms it on all language-specific MTEB leaderboards, especially for mid-/low-resource languages. But it requires roughly twice the VRAM at equivalent quantization. If you’re deploying on consumer hardware with <16 GB VRAM, gte-Qwen2-7B-instruct (Q4_K_M ~5 GB) is more practical. Choose F2LLM-v2-14B when quality on languages like Persian, Vietnamese, or Indonesian is critical.
vs. multilingual-e5-large-instruct (1.9B parameters, high quality)
e5-large is far smaller (fits easily on any GPU), but F2LLM-v2-14B dominates on 200-language benchmarks. The tradeoff is clear: if your application is English-centric or covers only a few European languages, e5-large is sufficient and cheaper to run. For true global coverage, F2LLM-v2-14B is the current SOTA.
Vector dimension flexibility – Both gte-Qwen2 and e5 offer matryoshka-like outputs, but F2LLM-v2-14B’s matryoshka training is deeper, allowing you to use 256-dim vectors with minimal quality loss, reducing storage and search costs.