CodeFuse-AI (Ant Group)

F2LLM-v2-8B

Large-scale 8B multilingual embedder delivering near-flagship quality at lower inference cost.

7.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

A workable 7.6B-parameter dense embedding model from CodeFuse-AI (Ant Group). Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters7.6B

Active Params6.9B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size15.1 GB

Community

Monthly Downloads828

Likes7

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

66.2

Classification

71.9

Clustering

60.6

STS

76.5

MBA Open Score

50.7CC

Benchmark60%

68.8

Popularity25%

15.6

Efficiency15%

37.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	4.7 GB
Acer Veriton GN100 AI MiniAcer	SS	4.7 GB
AMD Instinct MI300XAMD	SS	4.7 GB
AMD Instinct MI325XAMD	SS	4.7 GB
AMD Instinct MI355XAMD	SS	4.7 GB
AMD Radeon RX 7600 8GBAMD	SS	4.7 GB
AMD Radeon RX 7700 XTAMD	SS	4.7 GB
AMD Radeon RX 7800 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTXAMD	SS	4.7 GB
AMD Radeon RX 9070AMD	SS	4.7 GB
AMD Radeon RX 9070 XTAMD	SS	4.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	4.7 GB
Apple M4Apple	SS	4.7 GB
Apple M4 Max (40-core GPU)Apple	SS	4.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple M5Apple	SS	4.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	4.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	4.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	4.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	4.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	4.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

F2LLM-v2-8B is a general-purpose, multilingual embedding model developed by CodeFuse-AI (Ant Group). At 7.6B parameters (dense architecture), it is one of the largest publicly available text embedding models designed for local inference. Its primary use case is generating high-quality dense vector representations for retrieval-augmented generation (RAG), semantic search, clustering, and classification across more than 200 languages.

The model is the second generation of the F2LLM family, trained on a curated composite of 60 million publicly available high-quality examples. Unlike many embedding models that cap out at a few hundred million parameters, F2LLM-v2-8B pushes into a scale that previously required proprietary APIs. It delivers near-flagship embedding quality at an inference cost that fits on a single consumer GPU—making it a practical choice for developers who need state-of-the-art multilingual retrieval without paying per-query API bills.

F2LLM-v2-8B is released under the Apache 2.0 license. The family includes base (Preview) and instruct variants; for production use, the instruct version (codefuse-ai/F2LLM-v2-8B) is recommended.

Architecture & Technical Details

F2LLM-v2-8B is a dense transformer encoder, not a decoder-only generative model. It uses a bidirectional attention mechanism optimized for producing fixed-size embeddings from variable-length text inputs. The architecture is built on top of Qwen3 backbone (as indicated in the training code) and supports Matryoshka Representation Learning (MRL) and knowledge distillation—both included as new features in V2.

Parameters: 7.6B (dense, all active at inference)
Context length: Not officially specified, but practical testing suggests at least 8,192 tokens (consistent with typical Qwen3 limits)
Embedding dimension: 4096 (default), though MRL enables truncation to lower dimensions (e.g., 1024, 2048) for storage or speed trade-offs without retraining
Supported languages: 200+, with particular emphasis on mid- and low-resource languages (e.g., Swahili, Catalan, Burmese, Tajik)
Pipeline tag: feature-extraction – designed for use with libraries like HuggingFace Transformers and Sentence Transformers

Because it is a dense model at 7.6B, peak VRAM at FP16 precision is approximately 15.2 GB. This is a key consideration for local deployment. However, quantization reduces the footprint significantly—see the Running Locally section for concrete numbers.

Capabilities & Use Cases

F2LLM-v2-8B is an embedding model, not a chat or completion model. Its strength lies in converting text into high-quality vectors that capture semantic, multilingual, and cross-lingual relationships.

Primary use cases:

Multilingual semantic search – Index documents in 200+ languages and query in any of them. The model’s training data deliberately included low-resource languages, making it a strong choice for global products.
Cross-lingual retrieval (CLIR) – Retrieve English documents from a Spanish query, or French documents from a Chinese query. The model learns language-agnostic representations.
Dense retrieval for RAG – Generate embeddings for a local knowledge base and retrieve relevant chunks with high recall. Can compete with API-based embedders like OpenAI text-embedding-3-large at a fraction of the latency.
Clustering and classification – Use the embeddings as features for downstream tasks (e.g., topic modeling, intent classification) without fine-tuning.

Concrete example: A developer building a support bot that serves users in Spanish, Arabic, and Vietnamese can use F2LLM-v2-8B to index their FAQ database once and serve queries in any language. The same pipeline works for documentation retrieval, product catalogs, or internal knowledge bases.

Running F2LLM-v2-8B Locally

This model is sized to run on consumer hardware with careful quantization choices. Here are the real-world specs.

VRAM Requirements

Precision	Approximate VRAM	Notes
FP16	~15.2 GB	Full quality; requires a 16GB+ GPU (RTX 4080/4090, A4000, A5000, M4 Max with 24GB+)
Q4_K_M	~4.8 GB	Recommended default – minimal quality loss, fits 8GB cards
Q5_K_M	~5.6 GB	Slightly higher quality, still fits 8GB cards
Q8_0	~8.0 GB	Good trade-off if you have 8–12GB VRAM

Consumer Hardware Recommendations

RTX 4090 (24GB) – Run FP16 or Q8_0 comfortably; multiple parallel encoding streams possible.
RTX 4080 / 4070 Ti Super (16GB) – FP16 is tight but works; Q8_0 is safe.
RTX 4060 / 3060 (8–12GB) – Use Q4_K_M or Q5_K_M. Expect good throughput for batch encoding.
Apple Silicon (M4 Max 24–48GB) – FP16 runs in unified memory; performance is limited by memory bandwidth (around 200–400 embeddings per second depending on sequence length).
MacBook Pro M1/M2 (8–16GB) – Only feasible with Q4_K_M. Not ideal for high-throughput production, but fine for prototyping.

Expected Performance

Performance depends heavily on sequence length and batch size. For a typical batch of 1 query (512 tokens) with a single embedding:

RTX 4090 (FP16): ~500–800 embeddings per second
RTX 4090 (Q4_K_M): ~800–1200 embeddings per second
M4 Max (FP16): ~200–400 embeddings per second
RTX 3060 12GB (Q4_K_M): ~200–300 embeddings per second

For indexing large corpora, batching (e.g., batch size 32) multiplies throughput linearly until memory is exhausted.

Getting Started

The fastest way to run F2LLM-v2-8B locally is via Ollama (if a GGUF conversion is available) or directly with the HuggingFace transformers library using the Sentence Transformers integration. The model card on HuggingFace provides a ready-to-run snippet:

1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("codefuse-ai/F2LLM-v2-8B")
4embeddings = model.encode(["Your text here"])

For quantized GGUF versions, use llama.cpp or Ollama after converting the model with convert.py.

How It Compares

F2LLM-v2-8B competes in the league of large, open-source multilingual embedding models. The most direct alternatives are:

intfloat/multilingual-e5-large (335M parameters) – Much smaller, far less VRAM (under 1GB at FP16). Good for low-resource setups, but quality on mid/low-resource languages is noticeably lower. F2LLM-v2-8B outperforms it on cross-lingual retrieval benchmarks (per the F2LLM paper).
BAAI/bge-m3 (568M parameters) – Another strong multilingual embedder with dense retrieval and ColBERT support. BGE-M3 is more memory-efficient (2–3GB FP16) and supports dense+sparse hybrid search. F2LLM-v2-8B wins on raw representation quality and language coverage.
Cohere Embed V3 (proprietary API) – F2LLM-v2-8B offers comparable quality offline without per-query costs. The trade-off is self-hosting overhead.

When to choose F2LLM-v2-8B: You need high multilingual quality, especially for low-resource languages, and have a GPU with at least 8GB VRAM. You want full control over the embedding pipeline and no API dependency.

When to choose a smaller model: If you are constrained to a CPU or an 8GB GPU and must prioritize throughput over maximum accuracy, models like BGE-M3 or multilingual-e5-large are more practical.

Trade-off summary: F2LLM-v2-8B delivers state-of-the-art multilingual embeddings at a size that is demanding but feasible for a single consumer GPU. It is the best open choice for developers who need an on-premises alternative to large API-based embedders and are willing to allocate 5–8GB VRAM for quantization.

Related Models

CodeFuse-AI (Ant Group)

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

CodeFuse-AI (Ant Group)

F2LLM-v2-8B

Large-scale 8B multilingual embedder delivering near-flagship quality at lower inference cost.

7.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source embedding text workloads

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters7.6B

Active Params6.9B

ArchitectureDense

ProviderCodeFuse-AI (Ant Group)

Download Size15.1 GB

Community

Monthly Downloads828

Likes7

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

Retrieval

66.2

Classification

71.9

Clustering

60.6

STS

76.5

MBA Open Score

50.7CC

Benchmark60%

68.8

Popularity25%

15.6

Efficiency15%

37.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	4.7 GB
Acer Veriton GN100 AI MiniAcer	SS	4.7 GB
AMD Instinct MI300XAMD	SS	4.7 GB
AMD Instinct MI325XAMD	SS	4.7 GB
AMD Instinct MI355XAMD	SS	4.7 GB
AMD Radeon RX 7600 8GBAMD	SS	4.7 GB
AMD Radeon RX 7700 XTAMD	SS	4.7 GB
AMD Radeon RX 7800 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTAMD	SS	4.7 GB
AMD Radeon RX 7900 XTXAMD	SS	4.7 GB
AMD Radeon RX 9070AMD	SS	4.7 GB
AMD Radeon RX 9070 XTAMD	SS	4.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	4.7 GB
Apple M4Apple	SS	4.7 GB
Apple M4 Max (40-core GPU)Apple	SS	4.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple M5Apple	SS	4.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	4.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	4.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	4.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	4.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	4.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	4.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	4.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

F2LLM-v2-8B is released under the Apache 2.0 license. The family includes base (Preview) and instruct variants; for production use, the instruct version (codefuse-ai/F2LLM-v2-8B) is recommended.

Architecture & Technical Details

Parameters: 7.6B (dense, all active at inference)
Context length: Not officially specified, but practical testing suggests at least 8,192 tokens (consistent with typical Qwen3 limits)
Embedding dimension: 4096 (default), though MRL enables truncation to lower dimensions (e.g., 1024, 2048) for storage or speed trade-offs without retraining
Supported languages: 200+, with particular emphasis on mid- and low-resource languages (e.g., Swahili, Catalan, Burmese, Tajik)
Pipeline tag: feature-extraction – designed for use with libraries like HuggingFace Transformers and Sentence Transformers

Capabilities & Use Cases

Primary use cases:

Multilingual semantic search – Index documents in 200+ languages and query in any of them. The model’s training data deliberately included low-resource languages, making it a strong choice for global products.
Cross-lingual retrieval (CLIR) – Retrieve English documents from a Spanish query, or French documents from a Chinese query. The model learns language-agnostic representations.
Dense retrieval for RAG – Generate embeddings for a local knowledge base and retrieve relevant chunks with high recall. Can compete with API-based embedders like OpenAI text-embedding-3-large at a fraction of the latency.
Clustering and classification – Use the embeddings as features for downstream tasks (e.g., topic modeling, intent classification) without fine-tuning.

Running F2LLM-v2-8B Locally

This model is sized to run on consumer hardware with careful quantization choices. Here are the real-world specs.

VRAM Requirements

Precision	Approximate VRAM	Notes
FP16	~15.2 GB	Full quality; requires a 16GB+ GPU (RTX 4080/4090, A4000, A5000, M4 Max with 24GB+)
Q4_K_M	~4.8 GB	Recommended default – minimal quality loss, fits 8GB cards
Q5_K_M	~5.6 GB	Slightly higher quality, still fits 8GB cards
Q8_0	~8.0 GB	Good trade-off if you have 8–12GB VRAM

Consumer Hardware Recommendations

RTX 4090 (24GB) – Run FP16 or Q8_0 comfortably; multiple parallel encoding streams possible.
RTX 4080 / 4070 Ti Super (16GB) – FP16 is tight but works; Q8_0 is safe.
RTX 4060 / 3060 (8–12GB) – Use Q4_K_M or Q5_K_M. Expect good throughput for batch encoding.
Apple Silicon (M4 Max 24–48GB) – FP16 runs in unified memory; performance is limited by memory bandwidth (around 200–400 embeddings per second depending on sequence length).
MacBook Pro M1/M2 (8–16GB) – Only feasible with Q4_K_M. Not ideal for high-throughput production, but fine for prototyping.

Expected Performance

Performance depends heavily on sequence length and batch size. For a typical batch of 1 query (512 tokens) with a single embedding:

RTX 4090 (FP16): ~500–800 embeddings per second
RTX 4090 (Q4_K_M): ~800–1200 embeddings per second
M4 Max (FP16): ~200–400 embeddings per second
RTX 3060 12GB (Q4_K_M): ~200–300 embeddings per second

For indexing large corpora, batching (e.g., batch size 32) multiplies throughput linearly until memory is exhausted.

Getting Started

1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("codefuse-ai/F2LLM-v2-8B")
4embeddings = model.encode(["Your text here"])

For quantized GGUF versions, use llama.cpp or Ollama after converting the model with convert.py.

How It Compares

F2LLM-v2-8B competes in the league of large, open-source multilingual embedding models. The most direct alternatives are:

intfloat/multilingual-e5-large (335M parameters) – Much smaller, far less VRAM (under 1GB at FP16). Good for low-resource setups, but quality on mid/low-resource languages is noticeably lower. F2LLM-v2-8B outperforms it on cross-lingual retrieval benchmarks (per the F2LLM paper).
BAAI/bge-m3 (568M parameters) – Another strong multilingual embedder with dense retrieval and ColBERT support. BGE-M3 is more memory-efficient (2–3GB FP16) and supports dense+sparse hybrid search. F2LLM-v2-8B wins on raw representation quality and language coverage.
Cohere Embed V3 (proprietary API) – F2LLM-v2-8B offers comparable quality offline without per-query costs. The trade-off is self-hosting overhead.

When to choose a smaller model: If you are constrained to a CPU or an 8GB GPU and must prioritize throughput over maximum accuracy, models like BGE-M3 or multilingual-e5-large are more practical.

Related Models

CodeFuse-AI (Ant Group)

F2LLM-v2-14B

14BDense

CodeFuse-AI (Ant Group)

F2LLM-v2-4B

4BDense

CodeFuse-AI (Ant Group)

F2LLM-v2-1.7B

1.7BDense

CodeFuse-AI (Ant Group)

F2LLM-v2-0.6B

0.596BDense

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.