
IBM's flagship 8B-parameter speech-language model for high-accuracy ASR and speech translation, modality-aligning Granite 3.3 8B Instruct with a conformer encoder for state-of-the-art English transcription among open models.
A workable 9B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
IBM’s Granite Speech 3.3 8B is a speech-language model that performs automatic speech recognition (ASR) and automatic speech translation (AST) in a compact 9B-parameter dense architecture. It is built by modality-aligning the existing Granite 3.3 8B Instruct text model with a conformer acoustic encoder, trained on publicly available datasets. The result is a state‑of‑the‑art open‑source English ASR model that also handles multilingual speech input (English, French, German, Spanish, Portuguese) and translates between those languages and English.
Granite Speech operates as a two‑pass system: the first pass transcribes audio to text; the second pass invokes the underlying Granite language model for tasks like translation, summarization, or question‑answering on the transcribed text. This explicit decoupling makes it straightforward to debug and integrate, and it preserves Granite’s original text capabilities (including safety alignments) when used in text-only mode.
Positioned against models like Whisper large‑v3 and SeamlessM4T‑v2, Granite Speech 3.3 8B matches or exceeds them on English ASR despite training on orders of magnitude less proprietary data. Its Apache 2.0 license and open release make it a strong candidate for local, privacy‑sensitive deployments.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every IBM model we track.

Explore the Family
The full Granite Speech family leaderboard with sizes, benchmark scores, and a release timeline.
Granite Speech 3.3 8B is a dense model with 9B total parameters—there are no mixture‑of‑experts gimmicks. For local inference, this means VRAM consumption scales linearly with the full parameter count rather than with a smaller active subset, but the architecture is straightforward to quantize and run on consumer hardware.
The speech‑specific components are:
In speech mode, the encoder, projector, and LoRA adapters are active. In text mode, those components are bypassed and the core Granite 3.3 8B Instruct runs directly (without LoRA), preserving its original text‑based reasoning and safety.
Context length is not officially specified, but the underlying Granite 3.3 8B supports at least 8K tokens. For ASR, output length is roughly proportional to audio duration; for AST, the translated text typically adds minimal overhead. Practitioners planning long‑form transcription should test with their own audio lengths using available quantized versions.
Granite Speech 3.3 8B excels at two tasks:
Concrete use cases:
The two‑pass design means you must explicitly call the language model for translation after the ASR call. This is not a limitation for most batch workflows but is something to account for in streaming or real‑time applications.
At full FP16 precision, a 9B parameter model requires roughly 18 GB of VRAM (9B × 2 bytes). Add overhead for the encoder and adapter weights, and you should plan for 20–24 GB of GPU memory for unaudited inference.
Quantization brings this into reach of consumer hardware:
| Quantization | VRAM (approx.) | Quality tradeoff |
|---|---|---|
| Q4_K_M | 6–7 GB | Negligible ASR accuracy loss |
| Q5_K_M | 8–9 GB | Near‑FP16 quality |
| Q8_0 | 11–12 GB | Minimal loss |
| FP16 | 20–24 GB | Reference quality |
Recommended quantization for most users: Q4_K_M. It fits comfortably on an 8 GB RTX 3070/4060 Ti or an Apple M1/M2 with unified memory above 16 GB. For optimal speed on an RTX 4090 (24 GB), try Q8_0 or Q5_K_M to maximize token generation rate while staying under VRAM limits.
Consumer GPUs that can run it:
Speed depends on audio length and quantization. ASR token output is typically short (a few hundred tokens per minute of speech), so throughput is less critical than in text generation. On an RTX 4090 with Q5_K_M, expect 80–120 tokens/second for the language model pass. The acoustic encoder (conformer) adds about 0.1–0.3 seconds per file for typical lengths (30 seconds of speech). Total end‑to‑end time for a 5‑minute audio file is usually under 2 seconds on fast GPUs.
The easiest way to run locally is via Ollama (if the model community ports it) or directly using llama.cpp with quantization support. Example command using llama-server:
1./llama-server -m granite-speech-3.3-8b-Q4_K_M.gguf --no-mmap --ngl 35
Then send audio as base64 or via a file endpoint. See the model’s GitHub repository for inference scripts in Jupyter notebooks.
| Model | Parameters | Modality | Strengths |
|---|---|---|---|
| Granite Speech 3.3 8B | 9B | ASR + AST | Top‑tier English ASR, Apache‑2.0, open data |
| Whisper large‑v3 | 1.5B | ASR + AST | Multilingual (100+ languages), larger model |
| SeamlessM4T‑v2 | 2.3B | ASR + AST + TTS | Many‑to‑many translation, but larger and heavier |
Choose Granite Speech 3.3 8B when:
Choose Whisper large‑v3 when:
Choose SeamlessM4T‑v2 when:
Granite Speech’s two‑pass design means you cannot run it as a single end‑to‑end model for translated speech output in one call; you must chain the ASR and then the LLM call. This is a minor concession for the benefit of a clearly separated, debuggable pipeline.
| $0.03 |
NVIDIA GeForce RTX 3080 TiVast.ai · On-Demand · 12 GB VRAM | $0.04 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.