IBM

Granite Speech 3.3 2B

IBM's compact 2B-parameter speech-language model for English/multilingual automatic speech recognition (ASR) and speech translation (AST), built by modality-aligning Granite 3.3 2B Instruct with a conformer acoustic encoder.

3B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters3B

ArchitectureDense

Training Cutoff2024-04

ProviderIBM

Download Size7.9 GB

Community

Monthly Downloads271.9K

Likes53

Last Updated17 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.0%

Overall Score

56.5BB

Benchmark40%

88.0

Popularity25%

48.0

Efficiency25%

13.3

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	2.3 GB
AMD Instinct MI300XAMD	SS	2.3 GB
AMD Instinct MI325XAMD	SS	2.3 GB
AMD Instinct MI355XAMD	SS	2.3 GB
AMD Radeon RX 7600 8GBAMD	SS	2.3 GB
AMD Radeon RX 7700 XTAMD	SS	2.3 GB
AMD Radeon RX 7800 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTXAMD	SS	2.3 GB
AMD Radeon RX 9070AMD	SS	2.3 GB
AMD Radeon RX 9070 XTAMD	SS	2.3 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.3 GB
Apple M4Apple	SS	2.3 GB
Apple M4 Max (40-core GPU)Apple	SS	2.3 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple M5Apple	SS	2.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.3 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.3 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.3 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	2.3 GB

Rows per page

Page 1 of 4

About This Model

Granite Speech 3.3 2B

Architecture: A two-pass speech-language model composed of:

Conformer acoustic encoder using block attention with self-conditioning, trained with Connectionist Temporal Classification (CTC).
Speech projector / temporal downsampler — a 2-layer windowed Q-Former operating on blocks of 15 1024-dim acoustic embeddings, producing a 10 Hz embedding rate (10× total temporal downsampling).
Granite 3.3 2B Instruct LLM backbone with 128k context length.
LoRA adapters (rank 64) on the LLM's query and value projection matrices, activated only in speech mode.

Modality: Speech-to-text. Operates in two modes: speech mode (encoder + projector + LoRA active for ASR/AST) and text mode (pure Granite 3.3 LLM, preserving safety and text capabilities).

Training: Trained on IBM's Blue Vela cluster (NVIDIA H100) using publicly available ASR/AST corpora plus synthetic data targeted at the speech-translation task. Revision 3.3.2 added multilingual inputs (English, French, German, Spanish, Portuguese) and a deeper acoustic encoder for improved English ASR.

Use cases: Enterprise transcription, English and European-language ASR, English↔X speech translation, and as a building block for downstream Granite text workflows.

Related Models

IBM

Granite Speech 3.3 8B

9BDense

IBM

Granite 4.0 1B Speech

2BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

IBM

Granite Speech 3.3 2B

3B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters3B

ArchitectureDense

Training Cutoff2024-04

ProviderIBM

Download Size7.9 GB

Community

Monthly Downloads271.9K

Likes53

Last Updated17 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.0%

Overall Score

56.5BB

Benchmark40%

88.0

Popularity25%

48.0

Efficiency25%

13.3

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	2.3 GB
AMD Instinct MI300XAMD	SS	2.3 GB
AMD Instinct MI325XAMD	SS	2.3 GB
AMD Instinct MI355XAMD	SS	2.3 GB
AMD Radeon RX 7600 8GBAMD	SS	2.3 GB
AMD Radeon RX 7700 XTAMD	SS	2.3 GB
AMD Radeon RX 7800 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTXAMD	SS	2.3 GB
AMD Radeon RX 9070AMD	SS	2.3 GB
AMD Radeon RX 9070 XTAMD	SS	2.3 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.3 GB
Apple M4Apple	SS	2.3 GB
Apple M4 Max (40-core GPU)Apple	SS	2.3 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple M5Apple	SS	2.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.3 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.3 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.3 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	2.3 GB

Rows per page

Page 1 of 4

About This Model

Granite Speech 3.3 2B

Architecture: A two-pass speech-language model composed of:

Conformer acoustic encoder using block attention with self-conditioning, trained with Connectionist Temporal Classification (CTC).
Speech projector / temporal downsampler — a 2-layer windowed Q-Former operating on blocks of 15 1024-dim acoustic embeddings, producing a 10 Hz embedding rate (10× total temporal downsampling).
Granite 3.3 2B Instruct LLM backbone with 128k context length.
LoRA adapters (rank 64) on the LLM's query and value projection matrices, activated only in speech mode.

Modality: Speech-to-text. Operates in two modes: speech mode (encoder + projector + LoRA active for ASR/AST) and text mode (pure Granite 3.3 LLM, preserving safety and text capabilities).

Use cases: Enterprise transcription, English and European-language ASR, English↔X speech translation, and as a building block for downstream Granite text workflows.

Related Models

IBM

Granite Speech 3.3 8B

9BDense

IBM

Granite 4.0 1B Speech

2BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.