IBM's compact 2B-parameter speech-language model for English/multilingual automatic speech recognition (ASR) and speech translation (AST), built by modality-aligning Granite 3.3 2B Instruct with a conformer acoustic encoder.
A solid 3B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: A two-pass speech-language model composed of:
Modality: Speech-to-text. Operates in two modes: speech mode (encoder + projector + LoRA active for ASR/AST) and text mode (pure Granite 3.3 LLM, preserving safety and text capabilities).
Training: Trained on IBM's Blue Vela cluster (NVIDIA H100) using publicly available ASR/AST corpora plus synthetic data targeted at the speech-translation task. Revision 3.3.2 added multilingual inputs (English, French, German, Spanish, Portuguese) and a deeper acoustic encoder for improved English ASR.
Use cases: Enterprise transcription, English and European-language ASR, English↔X speech translation, and as a building block for downstream Granite text workflows.