CodeFuse-AI (Ant Group)

F2LLM-v2-14B

Flagship 14B multilingual embedding model from CodeFuse-AI; SOTA on 11/17 MTEB benchmarks.

14B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A workable 14B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters14B

Active Params13.2B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size279.8 GB

Community

Monthly Downloads85.4K

Likes12

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

66.5

Classification

73.0

Clustering

60.9

STS

77.0

MBA Open Score

53.6CC

Benchmark60%

69.3

Popularity25%

43.3

Efficiency15%

7.4

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	8.6 GB
Acer Veriton GN100 AI MiniAcer	SS	8.6 GB
AMD Instinct MI300XAMD	SS	8.6 GB
AMD Instinct MI325XAMD	SS	8.6 GB
AMD Instinct MI355XAMD	SS	8.6 GB
AMD Radeon RX 7800 XTAMD	SS	8.6 GB
AMD Radeon RX 7900 XTAMD	SS	8.6 GB
AMD Radeon RX 7900 XTXAMD	SS	8.6 GB
AMD Radeon RX 9070AMD	SS	8.6 GB
AMD Radeon RX 9070 XTAMD	SS	8.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	8.6 GB
Apple M4Apple	SS	8.6 GB
Apple M4 Max (40-core GPU)Apple	SS	8.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	8.6 GB
Apple M5Apple	SS	8.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	8.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	8.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	8.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	8.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	8.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	8.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	8.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	8.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	8.6 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	8.6 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

F2LLM-v2-14B is the flagship model in CodeFuse-AI’s F2LLM-v2 family, built by Ant Group. It is a dense 14-billion-parameter text embedding model, not a generative language model. Its purpose is to convert any piece of text—short queries, documents, or code—into a dense vector that captures semantic meaning. This makes it the backbone for retrieval-augmented generation (RAG), semantic search, clustering, and classification workflows that run entirely on your own hardware.

Where this model matters most is multilingual embedding. F2LLM-v2-14B achieves state-of-the-art results on 11 out of 17 MTEB benchmarks, covering English, European, Scandinavian, Indic, East Asian, and many mid- to low-resource languages. It competes directly with models like gte-Qwen2-7B-instruct, multilingual-e5-large-instruct, and bge-multilingual-gemma2, but at a larger parameter count and with broader language coverage (200+ languages). If your pipeline needs to embed text in Arabic, Vietnamese, Persian, or dozens of other languages without sacrificing English performance, this model is the current leader.

Architecture & Technical Details

F2LLM-v2-14B is a dense transformer—no mixture-of-experts routing, no conditional computation. Every forward pass activates all 14B parameters. This means VRAM consumption is predictable: at full precision (FP16), the model alone requires ~28 GB of memory. With typical overhead for key-value caches and batch processing, expect ~32 GB for single-sequence inference.

The model uses a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation (as described in the F2LLM-v2 technical report). Matryoshka learning allows the model to output variable-length embeddings (e.g., 256, 512, 1024 dimensions) while maintaining strong performance at shorter lengths—useful when you need to reduce storage or search latency. The architecture is derived from the F2LLM-v2-Preview base model and fine-tuned for instruction-following embedding tasks.

Context length is not specified by the provider, but typical embedding models in this class support 512–8192 tokens. Given the multilingual training data (60 million curated samples), it can handle long documents, though you should benchmark for your specific retrieval corpus.

Capabilities & Use Cases

F2LLM-v2-14B is an embedding model, not a chatbot. Its primary capability is dense vector representation, optimized for semantic similarity and retrieval. Concrete use cases:

Multilingual RAG pipelines – index documents in Arabic, Chinese, Russian, French, Hindi, Korean, etc., and retrieve them with queries in any of 200+ languages. The model excels on mid- and low-resource languages that other embedding models ignore.
Cross-lingual search – query in English to find relevant content in German, Japanese, or Swahili. The model’s training data emphasized such pairs.
Text clustering and classification – group customer support tickets, research papers, or product descriptions across languages.
Code retrieval – although not a code-specific model, its training includes code-related data from CodeFuse’s broader research (Code-LLM family). Useful for embedding code comments, API docs, or issue descriptions.

The model supports instruction prefixes (e.g., "Represent this sentence for retrieval: {text}") via the Sentence Transformers library, which can improve task-specific accuracy.

Running F2LLM-v2-14B Locally

This is where hardware considerations matter. At 14B parameters, F2LLM-v2-14B is a mid-range heavyweight for local embedding servers. Here’s what you need:

VRAM requirements by quantization:

FP16 (full precision) – ~28 GB model weights + ~4 GB overhead. Minimum 32 GB VRAM. Only feasible on NVIDIA A6000 (48 GB), A100 (40/80 GB), dual RTX 3090/4090 setups via model parallelism, or cloud GPUs.
Q4_K_M (4-bit) – ~8 GB model weights + ~2 GB overhead. Minimum 10–12 GB VRAM. This makes it runnable on a single RTX 4090 (24 GB), RTX 4080 (16 GB), or M4 Max (up to 48 GB unified memory). Q4_K_M is the recommended quantization for most practitioners—good tradeoff between speed and quality loss (~1–2% retrieval score drop vs FP16).
Q3_K_S / Q2_K – ~6–7 GB weights. Fits on RTX 3080 (10/12 GB) or M3 Max. Quality degradation becomes noticeable for low-resource languages; avoid for production in multilingual settings.

Consumer GPUs and expected throughput:

RTX 4090 (24 GB) – Q4_K_M: 100–150 tokens/second (single sequence). With batch size 4–8, you can serve low-latency embeddings for moderate traffic.
RTX 4080 (16 GB) – Q4_K_M or Q3_K_S: 70–100 tokens/s. Sufficient for development and small-scale production.
M4 Max (48 GB unified) – FP16 possible, but memory bandwidth may cap at 60–80 tokens/s. Q4_K_M runs at similar speed.
Quadro RTX 8000 / A6000 – FP16 with headroom for large batches.

Fastest way to get started: Use Ollama with the F2LLM-v2-14B tag (if available) or load the model via Sentence Transformers directly from Hugging Face. The official model card provides a code snippet for sentence-transformers usage. For production, consider vLLM with embedding endpoints or a custom ONNX export.

Hardware requirements summary:

Minimum: RTX 3080 with 10 GB VRAM + Q3_K quantization
Recommended: RTX 4090 or A6000 with Q4_K_M
Ideal: A100 80 GB for FP16 with large batch inference

How It Compares

vs. gte-Qwen2-7B-instruct (7B dense, multilingual)

F2LLM-v2-14B outperforms it on all language-specific MTEB leaderboards, especially for mid-/low-resource languages. But it requires roughly twice the VRAM at equivalent quantization. If you’re deploying on consumer hardware with <16 GB VRAM, gte-Qwen2-7B-instruct (Q4_K_M ~5 GB) is more practical. Choose F2LLM-v2-14B when quality on languages like Persian, Vietnamese, or Indonesian is critical.

vs. multilingual-e5-large-instruct (1.9B parameters, high quality)

e5-large is far smaller (fits easily on any GPU), but F2LLM-v2-14B dominates on 200-language benchmarks. The tradeoff is clear: if your application is English-centric or covers only a few European languages, e5-large is sufficient and cheaper to run. For true global coverage, F2LLM-v2-14B is the current SOTA.

Vector dimension flexibility – Both gte-Qwen2 and e5 offer matryoshka-like outputs, but F2LLM-v2-14B’s matryoshka training is deeper, allowing you to use 256-dim vectors with minimal quality loss, reducing storage and search costs.

Related Models

CodeFuse-AI (Ant Group)

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

CodeFuse-AI (Ant Group)

F2LLM-v2-14B

Flagship 14B multilingual embedding model from CodeFuse-AI; SOTA on 11/17 MTEB benchmarks.

14B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters14B

Active Params13.2B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size279.8 GB

Community

Monthly Downloads85.4K

Likes12

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

66.5

Classification

73.0

Clustering

60.9

STS

77.0

MBA Open Score

53.6CC

Benchmark60%

69.3

Popularity25%

43.3

Efficiency15%

7.4

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	8.6 GB
Acer Veriton GN100 AI MiniAcer	SS	8.6 GB
AMD Instinct MI300XAMD	SS	8.6 GB
AMD Instinct MI325XAMD	SS	8.6 GB
AMD Instinct MI355XAMD	SS	8.6 GB
AMD Radeon RX 7800 XTAMD	SS	8.6 GB
AMD Radeon RX 7900 XTAMD	SS	8.6 GB
AMD Radeon RX 7900 XTXAMD	SS	8.6 GB
AMD Radeon RX 9070AMD	SS	8.6 GB
AMD Radeon RX 9070 XTAMD	SS	8.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	8.6 GB
Apple M4Apple	SS	8.6 GB
Apple M4 Max (40-core GPU)Apple	SS	8.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	8.6 GB
Apple M5Apple	SS	8.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	8.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	8.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	8.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	8.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	8.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	8.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	8.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	8.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	8.6 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	8.6 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

F2LLM-v2-14B is an embedding model, not a chatbot. Its primary capability is dense vector representation, optimized for semantic similarity and retrieval. Concrete use cases:

Multilingual RAG pipelines – index documents in Arabic, Chinese, Russian, French, Hindi, Korean, etc., and retrieve them with queries in any of 200+ languages. The model excels on mid- and low-resource languages that other embedding models ignore.
Cross-lingual search – query in English to find relevant content in German, Japanese, or Swahili. The model’s training data emphasized such pairs.
Text clustering and classification – group customer support tickets, research papers, or product descriptions across languages.
Code retrieval – although not a code-specific model, its training includes code-related data from CodeFuse’s broader research (Code-LLM family). Useful for embedding code comments, API docs, or issue descriptions.

The model supports instruction prefixes (e.g., "Represent this sentence for retrieval: {text}") via the Sentence Transformers library, which can improve task-specific accuracy.

Running F2LLM-v2-14B Locally

This is where hardware considerations matter. At 14B parameters, F2LLM-v2-14B is a mid-range heavyweight for local embedding servers. Here’s what you need:

VRAM requirements by quantization:

FP16 (full precision) – ~28 GB model weights + ~4 GB overhead. Minimum 32 GB VRAM. Only feasible on NVIDIA A6000 (48 GB), A100 (40/80 GB), dual RTX 3090/4090 setups via model parallelism, or cloud GPUs.
Q4_K_M (4-bit) – ~8 GB model weights + ~2 GB overhead. Minimum 10–12 GB VRAM. This makes it runnable on a single RTX 4090 (24 GB), RTX 4080 (16 GB), or M4 Max (up to 48 GB unified memory). Q4_K_M is the recommended quantization for most practitioners—good tradeoff between speed and quality loss (~1–2% retrieval score drop vs FP16).
Q3_K_S / Q2_K – ~6–7 GB weights. Fits on RTX 3080 (10/12 GB) or M3 Max. Quality degradation becomes noticeable for low-resource languages; avoid for production in multilingual settings.

Consumer GPUs and expected throughput:

RTX 4090 (24 GB) – Q4_K_M: 100–150 tokens/second (single sequence). With batch size 4–8, you can serve low-latency embeddings for moderate traffic.
RTX 4080 (16 GB) – Q4_K_M or Q3_K_S: 70–100 tokens/s. Sufficient for development and small-scale production.
M4 Max (48 GB unified) – FP16 possible, but memory bandwidth may cap at 60–80 tokens/s. Q4_K_M runs at similar speed.
Quadro RTX 8000 / A6000 – FP16 with headroom for large batches.

Hardware requirements summary: