
IBM's compact 2B-parameter speech-language model for English/multilingual automatic speech recognition (ASR) and speech translation (AST), built by modality-aligning Granite 3.3 2B Instruct with a conformer acoustic encoder.
A solid 3B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Granite Speech 3.3 2B is IBM’s compact speech-language model for automatic speech recognition (ASR) and automatic speech translation (AST). It packs 3 billion parameters into a dense architecture that runs on consumer hardware, making local speech processing practical for developers who need to keep audio data off cloud APIs.
This is a two-pass design. The model first transcribes audio to text, then users explicitly call the underlying Granite language model for downstream tasks. It is not an end-to-end speech assistant — it is a transcription and translation engine with a language model attached. That distinction matters for deployment planning.
IBM built this by modality-aligning their Granite 3.3 2B Instruct model with a conformer acoustic encoder. The result is a speech model that competes with dedicated ASR systems while adding multilingual support and speech translation capabilities. It targets enterprise applications where data sovereignty and latency matter more than cloud convenience.
Licensed under Apache 2.0 with a training cutoff of April 2024, Granite Speech 3.3 2B supports English, French, German, Spanish, and Portuguese speech inputs. It outputs text only — there is no speech synthesis or multimodal capability.
Granite Speech 3.3 2B uses a three-component architecture:
The two-pass design means the model operates in distinct modes. In speech mode, the encoder, projector, and LoRA adapters handle ASR and AST. Users must make a second call to process the transcribed text through the full Granite language model for tasks like summarization, question answering, or entity extraction.
This separation has practical implications for local deployment. The first pass (transcription) is relatively lightweight. The second pass (language model inference) requires the full 3B parameter model to be loaded. If you only need ASR, you can run the speech components alone and save VRAM.
Revision 3.3.2 introduced a deeper acoustic encoder and additional training data, improving English ASR accuracy over the initial release. The model processes audio inputs at variable lengths, with no specified context window — practical limits depend on your hardware and the audio chunking strategy you implement.
Granite Speech 3.3 2B handles two primary tasks:
Automatic Speech Recognition (ASR) — Transcribes English, French, German, Spanish, and Portuguese audio to text. English ASR is the strongest capability, benchmarked against dedicated ASR systems. The model outperforms several competitors trained on orders of magnitude more proprietary data on English benchmarks.
Automatic Speech Translation (AST) — Translates speech between English and the supported languages (X-En and En-X). This covers French, German, Spanish, Portuguese, plus additional languages for translation targets.
Concrete use cases:
The model is not designed for real-time streaming without additional engineering work. The two-pass design adds latency compared to integrated speech models. For batch processing or near-real-time use with appropriate chunking, it performs well on consumer hardware.
This is where Granite Speech 3.3 2B differentiates itself. At 3B parameters in a dense architecture, it fits on hardware that would struggle with 7B+ models.
| Quantization | Minimum VRAM | Recommended VRAM |
|---|---|---|
| FP16 | 6 GB | 8 GB |
| Q8_0 | 4 GB | 6 GB |
| Q4_K_M | 2.5 GB | 4 GB |
| Q4_0 | 2 GB | 3 GB |
For most users, Q4_K_M is the sweet spot. It preserves enough precision for accurate transcription while fitting comfortably on 4 GB VRAM cards.
On an RTX 4090 at Q4_K_M, expect 50-80 tokens per second for the transcription pass. The language model pass runs at typical 3B model speeds: 40-60 tokens per second at Q4_K_M. On an M4 Max, performance is comparable at 45-70 tokens per second.
On lower-end hardware like an RTX 3060 at Q4_K_M, expect 20-35 tokens per second for transcription. This is sufficient for batch processing but may introduce latency for interactive use.
The quickest path is through Ollama. Pull the model and run:
1ollama run granite-speech:3.3-2b
For more control over quantization and inference parameters, use llama.cpp or Hugging Face Transformers with the transformers library. The model card on Hugging Face provides example notebooks for the two-pass workflow.
vs. Whisper small (244M parameters) — Whisper is smaller and faster for pure ASR, but Granite Speech 3.3 2B offers significantly better accuracy on English benchmarks and adds speech translation. If you need multilingual ASR on constrained hardware, Whisper small is lighter. If accuracy and translation matter, Granite Speech wins.
vs. SeamlessM4T v2 (2.3B) — Meta’s model covers more languages and does speech-to-speech translation. Granite Speech 3.3 2B is more accurate on English ASR and runs more efficiently on consumer hardware due to its dense architecture. SeamlessM4T v2 is better for broad multilingual coverage; Granite Speech is better for focused English/major European language workflows.
vs. Granite Speech 3.3 8B — The 8B variant offers higher accuracy but requires significantly more VRAM (12 GB+ at Q4_K_M). The 2B model is the practical choice for running on consumer GPUs and laptops. If you have the hardware, the 8B version benchmarks better. If you need to run on a single consumer GPU, the 2B model is the right call.
The tradeoff is straightforward: Granite Speech 3.3 2B trades some accuracy for deployability. It is not the best ASR model on the market, but it is one of the best you can run locally on a laptop or mid-range GPU. For developers who need speech processing without cloud dependency, that tradeoff is worth evaluating against your accuracy requirements.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every IBM model we track.

Explore the Family
The full Granite Speech family leaderboard with sizes, benchmark scores, and a release timeline.