IBM

Granite Speech 3.3 2B

IBM's compact 2B-parameter speech-language model for English/multilingual automatic speech recognition (ASR) and speech translation (AST), built by modality-aligning Granite 3.3 2B Instruct with a conformer acoustic encoder.

3B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 3B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters3B

ArchitectureDense

Training Cutoff2024-04

ProviderIBM

Download Size7.9 GB

Community

Monthly Downloads505.4K

Likes55

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.0%

MBA Open Score

58.0BB

Benchmark40%

88.0

Popularity25%

52.2

Efficiency25%

15.2

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.3 GB
Acer Veriton GN100 AI MiniAcer	SS	2.3 GB
AMD Instinct MI300XAMD	SS	2.3 GB
AMD Instinct MI325XAMD	SS	2.3 GB
AMD Instinct MI355XAMD	SS	2.3 GB
AMD Radeon RX 7600 8GBAMD	SS	2.3 GB
AMD Radeon RX 7700 XTAMD	SS	2.3 GB
AMD Radeon RX 7800 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTXAMD	SS	2.3 GB
AMD Radeon RX 9070AMD	SS	2.3 GB
AMD Radeon RX 9070 XTAMD	SS	2.3 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.3 GB
Apple M4Apple	SS	2.3 GB
Apple M4 Max (40-core GPU)Apple	SS	2.3 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple M5Apple	SS	2.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.3 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.3 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.3 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.3 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Granite Speech 3.3 2B is IBM’s compact speech-language model for automatic speech recognition (ASR) and automatic speech translation (AST). It packs 3 billion parameters into a dense architecture that runs on consumer hardware, making local speech processing practical for developers who need to keep audio data off cloud APIs.

This is a two-pass design. The model first transcribes audio to text, then users explicitly call the underlying Granite language model for downstream tasks. It is not an end-to-end speech assistant — it is a transcription and translation engine with a language model attached. That distinction matters for deployment planning.

IBM built this by modality-aligning their Granite 3.3 2B Instruct model with a conformer acoustic encoder. The result is a speech model that competes with dedicated ASR systems while adding multilingual support and speech translation capabilities. It targets enterprise applications where data sovereignty and latency matter more than cloud convenience.

Licensed under Apache 2.0 with a training cutoff of April 2024, Granite Speech 3.3 2B supports English, French, German, Spanish, and Portuguese speech inputs. It outputs text only — there is no speech synthesis or multimodal capability.

Architecture & Technical Details

Granite Speech 3.3 2B uses a three-component architecture:

Conformer acoustic encoder with block attention and self-conditioning, trained with connectionist temporal classification (CTC)
Windowed query-transformer speech adapter that downsamples acoustic embeddings and maps them to the LLM’s text embedding space
Granite 3.3 2B Instruct as the underlying language model, with LoRA adapters for speech fine-tuning

The two-pass design means the model operates in distinct modes. In speech mode, the encoder, projector, and LoRA adapters handle ASR and AST. Users must make a second call to process the transcribed text through the full Granite language model for tasks like summarization, question answering, or entity extraction.

This separation has practical implications for local deployment. The first pass (transcription) is relatively lightweight. The second pass (language model inference) requires the full 3B parameter model to be loaded. If you only need ASR, you can run the speech components alone and save VRAM.

Revision 3.3.2 introduced a deeper acoustic encoder and additional training data, improving English ASR accuracy over the initial release. The model processes audio inputs at variable lengths, with no specified context window — practical limits depend on your hardware and the audio chunking strategy you implement.

Capabilities & Use Cases

Granite Speech 3.3 2B handles two primary tasks:

Automatic Speech Recognition (ASR) — Transcribes English, French, German, Spanish, and Portuguese audio to text. English ASR is the strongest capability, benchmarked against dedicated ASR systems. The model outperforms several competitors trained on orders of magnitude more proprietary data on English benchmarks.

Automatic Speech Translation (AST) — Translates speech between English and the supported languages (X-En and En-X). This covers French, German, Spanish, Portuguese, plus additional languages for translation targets.

Concrete use cases:

Local call transcription — Process meeting recordings or customer service calls entirely on-premise
Multilingual content pipelines — Transcribe and translate audio in supported languages without cloud API calls
Voice-controlled applications — Feed transcribed text to a local LLM for command interpretation
Accessibility tools — Generate real-time captions for live audio streams
Data preparation — Create transcriptions for fine-tuning other models on proprietary audio data

The model is not designed for real-time streaming without additional engineering work. The two-pass design adds latency compared to integrated speech models. For batch processing or near-real-time use with appropriate chunking, it performs well on consumer hardware.

Running Granite Speech 3.3 2B Locally

This is where Granite Speech 3.3 2B differentiates itself. At 3B parameters in a dense architecture, it fits on hardware that would struggle with 7B+ models.

VRAM Requirements by Quantization

Quantization	Minimum VRAM	Recommended VRAM
FP16	6 GB	8 GB
Q8_0	4 GB	6 GB
Q4_K_M	2.5 GB	4 GB
Q4_0	2 GB	3 GB

For most users, Q4_K_M is the sweet spot. It preserves enough precision for accurate transcription while fitting comfortably on 4 GB VRAM cards.

Consumer Hardware That Works

NVIDIA RTX 3060 (12 GB) — Run Q8_0 or FP16 with headroom for the second-pass language model
NVIDIA RTX 4060 (8 GB) — Q4_K_M or Q8_0, depending on whether you need the language model pass
NVIDIA RTX 4090 (24 GB) — FP16 with room for batch processing and concurrent tasks
Apple M4 Max (36-48 GB unified) — FP16 with substantial headroom
Apple M3 Pro (18 GB) — Q4_K_M comfortably, Q8_0 if you limit batch size
Radeon RX 7600 (8 GB) — Q4_K_M through Vulkan or ROCm
Intel Arc A770 (16 GB) — Q8_0 via OpenVINO or DirectML

Expected Performance

On an RTX 4090 at Q4_K_M, expect 50-80 tokens per second for the transcription pass. The language model pass runs at typical 3B model speeds: 40-60 tokens per second at Q4_K_M. On an M4 Max, performance is comparable at 45-70 tokens per second.

On lower-end hardware like an RTX 3060 at Q4_K_M, expect 20-35 tokens per second for transcription. This is sufficient for batch processing but may introduce latency for interactive use.

Getting Started

The quickest path is through Ollama. Pull the model and run:

1ollama run granite-speech:3.3-2b

For more control over quantization and inference parameters, use llama.cpp or Hugging Face Transformers with the transformers library. The model card on Hugging Face provides example notebooks for the two-pass workflow.

How It Compares

vs. Whisper small (244M parameters) — Whisper is smaller and faster for pure ASR, but Granite Speech 3.3 2B offers significantly better accuracy on English benchmarks and adds speech translation. If you need multilingual ASR on constrained hardware, Whisper small is lighter. If accuracy and translation matter, Granite Speech wins.

vs. SeamlessM4T v2 (2.3B) — Meta’s model covers more languages and does speech-to-speech translation. Granite Speech 3.3 2B is more accurate on English ASR and runs more efficiently on consumer hardware due to its dense architecture. SeamlessM4T v2 is better for broad multilingual coverage; Granite Speech is better for focused English/major European language workflows.

vs. Granite Speech 3.3 8B — The 8B variant offers higher accuracy but requires significantly more VRAM (12 GB+ at Q4_K_M). The 2B model is the practical choice for running on consumer GPUs and laptops. If you have the hardware, the 8B version benchmarks better. If you need to run on a single consumer GPU, the 2B model is the right call.

The tradeoff is straightforward: Granite Speech 3.3 2B trades some accuracy for deployability. It is not the best ASR model on the market, but it is one of the best you can run locally on a laptop or mid-range GPU. For developers who need speech processing without cloud dependency, that tradeoff is worth evaluating against your accuracy requirements.

Related Models

IBM

Granite Speech 3.3 8B

9BDense

IBM

Granite 4.0 1B Speech

2BDense

Explore the Provider

See all IBM models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every IBM model we track.

Open IBM

Granite Speech model family leaderboard.

Explore the Family

See every Granite Speech release

The full Granite Speech family leaderboard with sizes, benchmark scores, and a release timeline.

Open Granite Speech

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

IBM

Granite Speech 3.3 2B

3B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 3B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters3B

ArchitectureDense

Training Cutoff2024-04

ProviderIBM

Download Size7.9 GB

Community

Monthly Downloads505.4K

Likes55

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.0%

MBA Open Score

58.0BB

Benchmark40%

88.0

Popularity25%

52.2

Efficiency25%

15.2

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.3 GB
Acer Veriton GN100 AI MiniAcer	SS	2.3 GB
AMD Instinct MI300XAMD	SS	2.3 GB
AMD Instinct MI325XAMD	SS	2.3 GB
AMD Instinct MI355XAMD	SS	2.3 GB
AMD Radeon RX 7600 8GBAMD	SS	2.3 GB
AMD Radeon RX 7700 XTAMD	SS	2.3 GB
AMD Radeon RX 7800 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTXAMD	SS	2.3 GB
AMD Radeon RX 9070AMD	SS	2.3 GB
AMD Radeon RX 9070 XTAMD	SS	2.3 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.3 GB
Apple M4Apple	SS	2.3 GB
Apple M4 Max (40-core GPU)Apple	SS	2.3 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple M5Apple	SS	2.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.3 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.3 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.3 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.3 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Granite Speech 3.3 2B uses a three-component architecture:

Conformer acoustic encoder with block attention and self-conditioning, trained with connectionist temporal classification (CTC)
Windowed query-transformer speech adapter that downsamples acoustic embeddings and maps them to the LLM’s text embedding space
Granite 3.3 2B Instruct as the underlying language model, with LoRA adapters for speech fine-tuning

Capabilities & Use Cases

Granite Speech 3.3 2B handles two primary tasks:

Concrete use cases:

Local call transcription — Process meeting recordings or customer service calls entirely on-premise
Multilingual content pipelines — Transcribe and translate audio in supported languages without cloud API calls
Voice-controlled applications — Feed transcribed text to a local LLM for command interpretation
Accessibility tools — Generate real-time captions for live audio streams
Data preparation — Create transcriptions for fine-tuning other models on proprietary audio data

Running Granite Speech 3.3 2B Locally

This is where Granite Speech 3.3 2B differentiates itself. At 3B parameters in a dense architecture, it fits on hardware that would struggle with 7B+ models.

VRAM Requirements by Quantization

Quantization	Minimum VRAM	Recommended VRAM
FP16	6 GB	8 GB
Q8_0	4 GB	6 GB
Q4_K_M	2.5 GB	4 GB
Q4_0	2 GB	3 GB

For most users, Q4_K_M is the sweet spot. It preserves enough precision for accurate transcription while fitting comfortably on 4 GB VRAM cards.

Consumer Hardware That Works

NVIDIA RTX 3060 (12 GB) — Run Q8_0 or FP16 with headroom for the second-pass language model
NVIDIA RTX 4060 (8 GB) — Q4_K_M or Q8_0, depending on whether you need the language model pass
NVIDIA RTX 4090 (24 GB) — FP16 with room for batch processing and concurrent tasks
Apple M4 Max (36-48 GB unified) — FP16 with substantial headroom
Apple M3 Pro (18 GB) — Q4_K_M comfortably, Q8_0 if you limit batch size
Radeon RX 7600 (8 GB) — Q4_K_M through Vulkan or ROCm
Intel Arc A770 (16 GB) — Q8_0 via OpenVINO or DirectML

Expected Performance

On lower-end hardware like an RTX 3060 at Q4_K_M, expect 20-35 tokens per second for transcription. This is sufficient for batch processing but may introduce latency for interactive use.