IBM

Granite Speech 3.3 8B

IBM's flagship 8B-parameter speech-language model for high-accuracy ASR and speech translation, modality-aligning Granite 3.3 8B Instruct with a conformer encoder for state-of-the-art English transcription among open models.

9B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 9B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters9B

ArchitectureDense

Training Cutoff2024-04

ProviderIBM

Download Size20.1 GB

Community

Monthly Downloads191.2K

Likes171

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.7%

MBA Open Score

56.0BB

Benchmark40%

88.5

Popularity25%

54.1

Efficiency25%

4.3

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	6.0 GB
Acer Veriton GN100 AI MiniAcer	SS	6.0 GB
AMD Instinct MI300XAMD	SS	6.0 GB
AMD Instinct MI325XAMD	SS	6.0 GB
AMD Instinct MI355XAMD	SS	6.0 GB
AMD Radeon RX 7700 XTAMD	SS	6.0 GB
AMD Radeon RX 7800 XTAMD	SS	6.0 GB
AMD Radeon RX 7900 XTAMD	SS	6.0 GB
AMD Radeon RX 7900 XTXAMD	SS	6.0 GB
AMD Radeon RX 9070AMD	SS	6.0 GB
AMD Radeon RX 9070 XTAMD	SS	6.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	6.0 GB
Apple M4Apple	SS	6.0 GB
Apple M4 Max (40-core GPU)Apple	SS	6.0 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	6.0 GB
Apple M5Apple	SS	6.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	6.0 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	6.0 GB
Apple Mac Mini (M1, 2020)Apple	SS	6.0 GB
Apple Mac Mini (M2, 2023)Apple	SS	6.0 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	6.0 GB
Apple Mac Mini (M4, 2024)Apple	SS	6.0 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	6.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	6.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	6.0 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

IBM’s Granite Speech 3.3 8B is a speech-language model that performs automatic speech recognition (ASR) and automatic speech translation (AST) in a compact 9B-parameter dense architecture. It is built by modality-aligning the existing Granite 3.3 8B Instruct text model with a conformer acoustic encoder, trained on publicly available datasets. The result is a state‑of‑the‑art open‑source English ASR model that also handles multilingual speech input (English, French, German, Spanish, Portuguese) and translates between those languages and English.

Granite Speech operates as a two‑pass system: the first pass transcribes audio to text; the second pass invokes the underlying Granite language model for tasks like translation, summarization, or question‑answering on the transcribed text. This explicit decoupling makes it straightforward to debug and integrate, and it preserves Granite’s original text capabilities (including safety alignments) when used in text-only mode.

Positioned against models like Whisper large‑v3 and SeamlessM4T‑v2, Granite Speech 3.3 8B matches or exceeds them on English ASR despite training on orders of magnitude less proprietary data. Its Apache 2.0 license and open release make it a strong candidate for local, privacy‑sensitive deployments.

Architecture & Technical Details

Granite Speech 3.3 8B is a dense model with 9B total parameters—there are no mixture‑of‑experts gimmicks. For local inference, this means VRAM consumption scales linearly with the full parameter count rather than with a smaller active subset, but the architecture is straightforward to quantize and run on consumer hardware.

The speech‑specific components are:

Conformer acoustic encoder – uses block attention and self‑conditioning with connectionist temporal classification (CTC) loss to produce audio frame embeddings.
Windowed query‑transformer speech adapter – downsamples acoustic embeddings temporally and maps them into the text embedding space of the base language model.
LoRA adapters – additional low‑rank adapters tuned during training to further align the language model for speech tasks.

In speech mode, the encoder, projector, and LoRA adapters are active. In text mode, those components are bypassed and the core Granite 3.3 8B Instruct runs directly (without LoRA), preserving its original text‑based reasoning and safety.

Context length is not officially specified, but the underlying Granite 3.3 8B supports at least 8K tokens. For ASR, output length is roughly proportional to audio duration; for AST, the translated text typically adds minimal overhead. Practitioners planning long‑form transcription should test with their own audio lengths using available quantized versions.

Capabilities & Use Cases

Granite Speech 3.3 8B excels at two tasks:

English automatic speech recognition – The model’s primary design goal. Benchmarks (LibriSpeech, Common Voice, etc.) show it outperforms several other open and proprietary ASR systems of similar size. Revision 3.3.2 uses a deeper acoustic encoder and additional data, further reducing word‑error rate.

Speech translation – Input can be in English, French, German, Spanish, or Portuguese. Output translation is to English (X‑En) or from English to those same languages (En‑X). Translation quality is competitive with dedicated AST models trained on far larger datasets.

Concrete use cases:

Privacy‑first transcription – run entirely offline for medical, legal, or enterprise meetings.
Local voice assistants – transcribe user speech, then pass the text to a local LLM for intent parsing or action generation.
Multilingual content pipelines – transcribe podcasts or webinars in French or German, then translate the transcript to English for further analysis.
Low‑latency translation – real‑time subtitle generation for video conferencing or live events, provided you have enough GPU headroom.

The two‑pass design means you must explicitly call the language model for translation after the ASR call. This is not a limitation for most batch workflows but is something to account for in streaming or real‑time applications.

Running Granite Speech 3.3 8B Locally

VRAM & Hardware Requirements

At full FP16 precision, a 9B parameter model requires roughly 18 GB of VRAM (9B × 2 bytes). Add overhead for the encoder and adapter weights, and you should plan for 20–24 GB of GPU memory for unaudited inference.

Quantization brings this into reach of consumer hardware:

Quantization	VRAM (approx.)	Quality tradeoff
Q4_K_M	6–7 GB	Negligible ASR accuracy loss
Q5_K_M	8–9 GB	Near‑FP16 quality
Q8_0	11–12 GB	Minimal loss
FP16	20–24 GB	Reference quality

Recommended quantization for most users: Q4_K_M. It fits comfortably on an 8 GB RTX 3070/4060 Ti or an Apple M1/M2 with unified memory above 16 GB. For optimal speed on an RTX 4090 (24 GB), try Q8_0 or Q5_K_M to maximize token generation rate while staying under VRAM limits.

Consumer GPUs that can run it:

NVIDIA RTX 3090/4090 (24 GB) – can run Q8_0 or even FP16.
RTX 3080 / 4070 Ti (12–16 GB) – Q4_K_M works, though batch size may need to be 1.
Apple Mac with M2 Ultra / M3 Max (at least 32 GB unified memory) – runs Q4_K_M natively via llama.cpp.
Lower‑end cards (8 GB) – Q4_K_M is possible but may max out VRAM, especially with long audio inputs.

Expected Performance (tokens per second)

Speed depends on audio length and quantization. ASR token output is typically short (a few hundred tokens per minute of speech), so throughput is less critical than in text generation. On an RTX 4090 with Q5_K_M, expect 80–120 tokens/second for the language model pass. The acoustic encoder (conformer) adds about 0.1–0.3 seconds per file for typical lengths (30 seconds of speech). Total end‑to‑end time for a 5‑minute audio file is usually under 2 seconds on fast GPUs.

Quickstart

The easiest way to run locally is via Ollama (if the model community ports it) or directly using llama.cpp with quantization support. Example command using llama-server:

1./llama-server -m granite-speech-3.3-8b-Q4_K_M.gguf --no-mmap --ngl 35

Then send audio as base64 or via a file endpoint. See the model’s GitHub repository for inference scripts in Jupyter notebooks.

How It Compares

Model	Parameters	Modality	Strengths
Granite Speech 3.3 8B	9B	ASR + AST	Top‑tier English ASR, Apache‑2.0, open data
Whisper large‑v3	1.5B	ASR + AST	Multilingual (100+ languages), larger model
SeamlessM4T‑v2	2.3B	ASR + AST + TTS	Many‑to‑many translation, but larger and heavier

Choose Granite Speech 3.3 8B when:

You need the best English ASR accuracy for a given parameter count.
You want full Apache 2.0 license without restrictions on commercial use.
You plan to run on a single consumer GPU (e.g., 12–24 GB) and can tolerate a slightly larger model size than Whisper.

Choose Whisper large‑v3 when:

You require speech recognition for many languages (beyond the five that Granite supports).
You need a smaller model footprint (1.5B vs 9B) and don’t need translation.

Choose SeamlessM4T‑v2 when:

You need built‑in text‑to‑speech or many‑to‑many translation (e.g., Japanese to French).
You’re willing to accept a more complex pipeline with larger total memory requirements.

Granite Speech’s two‑pass design means you cannot run it as a single end‑to‑end model for translated speech output in one call; you must chain the ASR and then the LLM call. This is a minor concession for the benefit of a clearly separated, debuggable pipeline.

Related Models

IBM

Granite Speech 3.3 2B

3BDense

IBM

Granite 4.0 1B Speech

2BDense

Explore the Provider

See all IBM models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every IBM model we track.

Open IBM

Granite Speech model family leaderboard.

Explore the Family

See every Granite Speech release

The full Granite Speech family leaderboard with sizes, benchmark scores, and a release timeline.

Open Granite Speech

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

IBM

Granite Speech 3.3 8B

9B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 9B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters9B

ArchitectureDense

Training Cutoff2024-04

ProviderIBM

Download Size20.1 GB

Community

Monthly Downloads191.2K

Likes171

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.7%

MBA Open Score

56.0BB

Benchmark40%

88.5

Popularity25%

54.1

Efficiency25%

4.3

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	6.0 GB
Acer Veriton GN100 AI MiniAcer	SS	6.0 GB
AMD Instinct MI300XAMD	SS	6.0 GB
AMD Instinct MI325XAMD	SS	6.0 GB
AMD Instinct MI355XAMD	SS	6.0 GB
AMD Radeon RX 7700 XTAMD	SS	6.0 GB
AMD Radeon RX 7800 XTAMD	SS	6.0 GB
AMD Radeon RX 7900 XTAMD	SS	6.0 GB
AMD Radeon RX 7900 XTXAMD	SS	6.0 GB
AMD Radeon RX 9070AMD	SS	6.0 GB
AMD Radeon RX 9070 XTAMD	SS	6.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	6.0 GB
Apple M4Apple	SS	6.0 GB
Apple M4 Max (40-core GPU)Apple	SS	6.0 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	6.0 GB
Apple M5Apple	SS	6.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	6.0 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	6.0 GB
Apple Mac Mini (M1, 2020)Apple	SS	6.0 GB
Apple Mac Mini (M2, 2023)Apple	SS	6.0 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	6.0 GB
Apple Mac Mini (M4, 2024)Apple	SS	6.0 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	6.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	6.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	6.0 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

The speech‑specific components are:

Conformer acoustic encoder – uses block attention and self‑conditioning with connectionist temporal classification (CTC) loss to produce audio frame embeddings.
Windowed query‑transformer speech adapter – downsamples acoustic embeddings temporally and maps them into the text embedding space of the base language model.
LoRA adapters – additional low‑rank adapters tuned during training to further align the language model for speech tasks.

Capabilities & Use Cases

Granite Speech 3.3 8B excels at two tasks:

English automatic speech recognition – The model’s primary design goal. Benchmarks (LibriSpeech, Common Voice, etc.) show it outperforms several other open and proprietary ASR systems of similar size. Revision 3.3.2 uses a deeper acoustic encoder and additional data, further reducing word‑error rate.

Speech translation – Input can be in English, French, German, Spanish, or Portuguese. Output translation is to English (X‑En) or from English to those same languages (En‑X). Translation quality is competitive with dedicated AST models trained on far larger datasets.

Concrete use cases:

Privacy‑first transcription – run entirely offline for medical, legal, or enterprise meetings.
Local voice assistants – transcribe user speech, then pass the text to a local LLM for intent parsing or action generation.
Multilingual content pipelines – transcribe podcasts or webinars in French or German, then translate the transcript to English for further analysis.
Low‑latency translation – real‑time subtitle generation for video conferencing or live events, provided you have enough GPU headroom.

Running Granite Speech 3.3 8B Locally

VRAM & Hardware Requirements

Quantization brings this into reach of consumer hardware:

Quantization	VRAM (approx.)	Quality tradeoff
Q4_K_M	6–7 GB	Negligible ASR accuracy loss
Q5_K_M	8–9 GB	Near‑FP16 quality
Q8_0	11–12 GB	Minimal loss
FP16	20–24 GB	Reference quality

Consumer GPUs that can run it:

NVIDIA RTX 3090/4090 (24 GB) – can run Q8_0 or even FP16.
RTX 3080 / 4070 Ti (12–16 GB) – Q4_K_M works, though batch size may need to be 1.
Apple Mac with M2 Ultra / M3 Max (at least 32 GB unified memory) – runs Q4_K_M natively via llama.cpp.
Lower‑end cards (8 GB) – Q4_K_M is possible but may max out VRAM, especially with long audio inputs.

Expected Performance (tokens per second)

Quickstart

The easiest way to run locally is via Ollama (if the model community ports it) or directly using llama.cpp with quantization support. Example command using llama-server:

1./llama-server -m granite-speech-3.3-8b-Q4_K_M.gguf --no-mmap --ngl 35

Then send audio as base64 or via a file endpoint. See the model’s GitHub repository for inference scripts in Jupyter notebooks.

How It Compares

Model	Parameters	Modality	Strengths
Granite Speech 3.3 8B	9B	ASR + AST	Top‑tier English ASR, Apache‑2.0, open data
Whisper large‑v3	1.5B	ASR + AST	Multilingual (100+ languages), larger model
SeamlessM4T‑v2	2.3B	ASR + AST + TTS	Many‑to‑many translation, but larger and heavier

Choose Granite Speech 3.3 8B when:

You need the best English ASR accuracy for a given parameter count.
You want full Apache 2.0 license without restrictions on commercial use.
You plan to run on a single consumer GPU (e.g., 12–24 GB) and can tolerate a slightly larger model size than Whisper.

Choose Whisper large‑v3 when:

You require speech recognition for many languages (beyond the five that Granite supports).
You need a smaller model footprint (1.5B vs 9B) and don’t need translation.

Choose SeamlessM4T‑v2 when:

You need built‑in text‑to‑speech or many‑to‑many translation (e.g., Japanese to French).
You’re willing to accept a more complex pipeline with larger total memory requirements.