
IBM's compact 2B-parameter speech-language model for multilingual ASR and bidirectional speech translation, ranked #1 on the OpenASR multilingual leaderboard (5.52 average WER) while running efficiently on edge devices.
A solid 2B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Granite 4.0 1B Speech is IBM’s latest compact speech-language model for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST). Despite the “1B” in its name, this is a 2B-parameter dense model — half the size of its predecessor, granite-speech-3.3-2b, while delivering higher English transcription accuracy and faster inference. It recently claimed the #1 spot on the Hugging Face OpenASR leaderboard with an average Word Error Rate (WER) of 5.52, beating larger open models and even proprietary systems.
This model is built for practitioners who need on‑device speech processing — no cloud dependency, no data leaving the machine. It occupies a sweet spot between dedicated ASR engines (like Whisper variants) and larger speech‑language models that require datacenter GPUs. With its Apache 2.0 license and native support in transformers and vLLM, Granite 4.0 1B Speech is immediately usable in local pipelines for transcription, translation, and keyword‑driven recognition.
Granite 4.0 1B Speech uses a dense encoder‑decoder design with 2 billion total parameters. Unlike mixture‑of‑experts (MoE) models where only a subset of parameters activates per token, this dense architecture keeps the full parameter set active at all times. For a 2B model, that means predictable memory usage and consistent throughput — no token‑dependent VRAM spikes.
The model is built by modality‑aligning the granite-4.0-1b-base language model to speech using public and synthetic audio‑text corpora. A two‑pass pipeline is used: first, audio is processed by the acoustic encoder to produce a transcription (or translation), then the underlying Granite language model can be invoked separately for downstream text tasks. This separation gives developers explicit control over the workflow.
Key architectural details:
llama.cpp and Ollama via conversion.Granite 4.0 1B Speech is optimized for two core tasks:
Concrete use cases:
This model is designed for consumer and edge hardware. Here’s what you need to know to run it on‑prem.
| Quantization | Minimum VRAM | Recommended VRAM | Typical RAM (system) |
|---|---|---|---|
| FP16 (full) | 4.2 GB | 6 GB | 8 GB |
| Q8_0 | 2.8 GB | 4 GB | 6 GB |
| Q4_K_M | 1.6 GB | 2 GB | 4 GB |
| Q3_K_S | 1.2 GB | 2 GB | 4 GB |
The model’s dense architecture means VRAM scales linearly with quantization level. For most users, Q4_K_M is the sweet spot — it preserves >98% of FP16 accuracy while cutting memory in half. The total VRAM includes the model weights, audio input buffer, and key‑value cache for the language decoder.
With speculative decoding enabled and Q4_K_M quantization on an RTX 4090, you can expect:
On a laptop RTX 4060 8 GB, tokens‑per‑second drop to 15–25. Still comfortably above real‑time for most applications.
The quickest way to run Granite 4.0 1B Speech locally is via Ollama. Once the model is converted to GGUF format (IBM provides a conversion script), you can pull it with:
1ollama run granite-4.0-1b-speech:q4_K_M
Alternatively, use the transformers library directly with speculative decoding flag:
1from transformers import AutoModelForCTC, AutoProcessor2model = AutoModelForCTC.from_pretrained("ibm-granite/granite-4.0-1b-speech")3# Enable speculative decoding4model.generation_config.speculative_decoding = True
At the 2B parameter scale, Granite 4.0 1B Speech competes primarily with Whisper Small (1.5B parameters) and OpenAI Whisper Medium (0.8B). Here’s how they stack up:
| Model | Parameters | OpenASR Average WER | Multilingual | Keyword Biasing | Speculative Decoding | License |
|---|---|---|---|---|---|---|
| Granite 4.0 1B Speech | 2B | 5.52 | 6 languages | Yes | Yes | Apache 2.0 |
| Whisper Small | 1.5B | ~7.8 (est.) | 96 languages | No | No | MIT |
| Whisper Medium | 0.8B | ~9.1 (est.) | 96 languages | No | No | MIT |
When to choose Granite 4.0 1B Speech: You need the best possible WER in the 2B class, especially for English and the supported European languages plus Japanese. The keyword biasing is a genuine differentiator for enterprise use cases — Whisper has no equivalent. You also benefit from speculative decoding for faster inference.
When to choose Whisper Small/Medium: If your workflow requires support for languages outside Granite’s six (e.g., Arabic, Mandarin, Hindi) or you need a model that is already packaged in many open‑source tools with minimal conversion. Whisper also has more established community documentation for edge deployment.
Tradeoffs: Granite’s dense 2B architecture uses slightly more VRAM than Whisper Small (which is also dense) — about 0.5 GB extra at the same quantization. The real penalty is language coverage: if you need 96 languages, Whisper is the only viable choice in this weight class. For the languages it does support, Granite’s accuracy is clearly superior.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every IBM model we track.

Explore the Family
The full Granite family leaderboard with sizes, benchmark scores, and a release timeline.