
SOTA sub-1B multilingual embedding model, distilled from a Qwen3-Embedding-4B teacher.
A strong 0.596B-parameter dense embedding model from Jina AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
jina-embeddings-v5-text-small is a dense, multilingual text embedding model with 0.596B parameters, developed by Jina AI and released on February 18, 2026. It is the mid-size entry in the jina-embeddings-v5 family, sitting between the nano (239M) and any future full-size variant. The model is built on a Qwen3-0.6B-Base backbone and uses a two-stage training pipeline: knowledge distillation from a much larger Qwen3-Embedding-4B teacher, followed by task-specific contrastive fine-tuning with four dedicated LoRA adapters.
What sets this model apart is its ability to deliver state-of-the-art performance on MTEB English v2 (71.7 average) and MMTEB (67.7) for a model under 1B parameters. It supports 119+ languages natively, produces 1024‑dimensional embeddings via last-token pooling, and handles context lengths up to 32,768 tokens — a significant jump from earlier sub‑1B embedding models.
This is a practical, lightweight choice for developers who need strong multilingual retrieval, clustering, or classification on local hardware. It competes directly with models like jina-embeddings-v3 (570M) and bge-m3 (570M), but offers a better performance-per-parameter ratio thanks to its distillation‑based training.
The model uses a standard dense transformer architecture with the following key specifications:
The model is designed for local deployment. Jina provides GGUF quantized versions (Q2_K through Q8_0) and MLX weights for Apple Silicon. Because the backbone is dense and relatively small (0.596B), even the full-precision (BF16) model fits comfortably within 1.2 GB of VRAM. With 4‑bit quantization (Q4_K_M) the model size drops to ~380 MB, making it feasible on low‑VRAM GPUs and even on modern laptops.
The model is trained for four distinct tasks, each with its own adapter:
Concrete use cases where this model excels:
The model is not suited for generative tasks or code generation — it is a text‑to‑vector model only.
The model’s small size makes it one of the most accessible sub‑1B embedding models for local inference:
| Quantization | VRAM (approx.) | Example Hardware |
|---|---|---|
| BF16 (full) | ~1.2 GB | Any GPU with 2+ GB VRAM |
| Q8_0 | ~700 MB | RTX 3060, RX 6600, M1 Pro |
| Q4_K_M | ~380 MB | GTX 1650, M1, Intel Arc |
| Q2_K | ~250 MB | Integrated GPUs, 8 GB RAM CPU |
Minimum: A GPU with 2 GB VRAM (e.g., GTX 1060) can run the Q4_K_M variant comfortably. For CPU‑only inference, 4‑bit quantized models run well on any modern x86 processor with AVX2 support.
Recommended: An RTX 4090 or M4 Max will run the BF16 model with throughput exceeding 500 tokens per second on batch sizes of 32. For high‑throughput production on a single GPU, Q8_0 offers the best speed‑quality trade‑off.
1ollama run jina-embeddings-v5-text-small:q4_k_m
This downloads the pre‑quantized GGUF model and exposes a /api/embed endpoint compatible with standard libraries. On a desktop with an RTX 4090, expect 1500–2500 tokens/sec for single‑sequence encoding. On an M4 MacBook Pro, 800–1200 tokens/sec (Q4_K_M with Metal acceleration).
vs jina-embeddings-v3 (570M parameters, same provider):
v5-small outperforms v3 on MTEB English v2 by ~3 points and MMTEB by ~2 points, thanks to the distillation‑based training from a 4B teacher. Both models support 119+ languages, but v5 has a longer context window (32K vs 8K) and Matryoshka flexibility. Choose v5-small if you need better retrieval accuracy on long documents; v3 remains a solid option if you rely on its older, well‑tested pipeline.
vs bge-m3 (570M parameters, BAAI):
bge-m3 is a strong multilingual embedding model trained on a massive corpus, with similar performance on MTEB (English) but lower scores on MMTEB. v5-small edges ahead on multilingual tasks and offers the adapter‑based task specialization — you can tune the model for a specific task without retraining the whole backbone. bge-m3 has a slightly smaller embedding dimension (1024 vs 1024 the same) but lacks Matryoshka. For a single‑model solution covering many tasks, v5-small is more flexible; bge-m3 is better if you need a no‑frills, well‑documented drop‑in replacement for existing systems.
In summary, jina-embeddings-v5-text-small delivers best‑in‑class efficiency for local multilingual embedding workloads. Its 0.596B size, 32K context, and task adapters make it a practical choice for developers who need high accuracy without cloud dependencies.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Jina AI model we track.

Explore the Family
The full Jina Embeddings family leaderboard with sizes, benchmark scores, and a release timeline.