
Tencent's SOTA Gemma3-12B-based multilingual embedder, #1 on MMTEB as of November 2025.
A solid 11.8B-parameter dense embedding model from Tencent. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 7 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
KaLM-Embedding-Gemma3-12B-2511 is a dense, 11.8-billion-parameter text embedding model developed by Tencent. As of November 2025, it holds the #1 position on the Massive Multilingual Text Embedding Benchmark (MMTEB), outperforming every other model on aggregate Borda rank. This isn’t a general-purpose LLM—it’s a dedicated embedder built for producing high-quality vector representations of text. You use it for retrieval, semantic similarity, clustering, classification, and reranking. It competes directly with models like Qwen3-Embedding-8B and llama-embed-nemotron-8b, and it beats them on the vast majority of MTEB task categories.
Tencent released this model under the Tencent KaLM Embedding Community License. The base architecture is a fine-tuned google/gemma-3-12b-pt, trained on the proprietary KaLM-Embedding/KaLM-embedding-finetuning-data dataset. It is text-only and designed for local inference on engineering and production workloads.
The model is a dense transformer with 11.8 billion parameters—a single, monolithic set of weights. Unlike Mixture-of-Experts (MoE) architectures that route tokens through different subsets of parameters, every token in a dense model uses the full parameter set. This means VRAM consumption is flat: you need enough memory to load the entire 11.8B weight matrix plus activations. The tradeoff is that dense models tend to have simpler, more predictable inference behavior, and they don’t suffer from routing latency or load-balancing issues.
Key specs:
The model loads as a standard sentence-transformers model. It uses safetensors sharded across five files, making it straightforward to run with Transformers, vLLM, or Ollama.
Tencent designed KaLM-Embedding-Gemma3-12B-2511 for multilingual text embedding. The MMTEB results show top scores in:
Concrete business use cases:
gte-Qwen2-7B-instruct) to improve retrieval recall for complex, multilingual knowledge bases.The model is not a chatbot or text generator. It outputs vectors, not tokens. It is also not instruction-tuned for general reasoning—its “task prompts” are for embedding-specific instructions (e.g., “Retrieve the relevant documents” vs “Classify the text”). You feed the prompt + text, get a single embedding.
This is an 11.8B dense model. Running it on consumer hardware requires careful attention to quantization and GPU memory.
| Precision | VRAM (approx) | Notes |
|---|---|---|
| FP16 / BF16 | ~24 GB | Full precision, maximum embedding quality |
| Q4_K_M (4-bit) | ~7–8 GB | Best tradeoff for most users |
| Q3_K_M (3-bit) | ~6 GB | Lower quality, acceptable for retrieval tasks |
| Q2_K (2-bit) | ~4.5 GB | Degraded, only for memory-constrained scenarios |
Minimum hardware: NVIDIA RTX 3090 / RTX 4090 (24 GB) with Q4_K_M quantization. You can also run on an RTX 4070 Ti (16 GB) by using Q4_K_S or Q3_K_M. For Apple Silicon, an M4 Max or M4 Pro with 36 GB or more will handle FP16 sharding via Metal.
Recommended GPU: RTX 4090 or RTX 6000 Ada. For server deployments, A10G (24 GB) or A100 (40/80 GB) let you run FP16 without quantization.
Performance: With Q4_K_M on an RTX 4090, expect ~50–80 tokens per second for encoding single sentences. For batch inference (e.g., 32 inputs of length 512), throughput scales to 150–200 tokens/s. These numbers depend on context length and batch size. The model uses last-token pooling, so inference is fast—no need to decode autoregressively.
Quickest start: Install ollama and import the model:
1ollama pull tencent/kalm-embedding-gemma3:12b
How to run 11.8B on a consumer GPU: If you only have 16 GB, use 4‑bit quantization (llama.cpp with -b 512, or exllamav2 with --quant 4bit). The sentence-transformers library loads the full model; you must convert to GGUF or AWQ. The community has pre-quantized GGUF versions available on Hugging Face.
Running with custom code:
1from sentence_transformers import SentenceTransformer23model = SentenceTransformer("tencent/KaLM-Embedding-Gemma3-12B-2511", device="cuda")4embeddings = model.encode(["Example sentence"], normalize_embeddings=True) # for cosine similarity
vs Qwen3-Embedding-8B: Qwen3-8B is slightly smaller (8B) and cheaper to run. KaLM-Gemma3-12B beats it on mean task score (72.32 vs 70.58) and on key retrieval tasks (75.66 vs 70.88). Qwen3-8B wins on clustering and STS. If your workload is heavy on clustering, Qwen3 is a strong alternative. For retrieval and classification, KaLM is the clear choice.
vs llama-embed-nemotron-8b: Nemotron-8B is competitive on reranking and STS, but KaLM dominates bitext mining, classification, and retrieval. Both are dense 8–12B models. KaLM’s multilingual strength gives it an edge if you need cross-lingual support.
Hardware: Both competitors are smaller (8B), so they run easier on 16 GB GPUs. KaLM at 11.8B demands 4‑bit quantization for the same hardware. If you have 24 GB or more, KaLM provides the best retrieval quality among open‑source embedders at this scale.
License: Tencent’s community license is more permissive than pure research‑only, but check restrictions if you plan to commercialize. Nemotron is under NVIDIA’s Open Model License; Qwen3 is under Apache 2.0.
Choose KaLM-Embedding-Gemma3-12B-2511 when you need multilingual, top‑tier retrieval and classification and have adequate GPU memory. Accept the quantization overhead on consumer cards, or invest in a 24 GB+ GPU. For resource‑constrained setups or clustering‑first pipelines, the smaller alternatives are practical compromises.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Tencent model we track.