IBM's compact 2B-parameter speech-language model for English/multilingual automatic speech recognition (ASR) and speech translation (AST), built by modality-aligning Granite 3.3 2B Instruct with a conformer acoustic encoder.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: A two-pass speech-language model composed of:
Modality: Speech-to-text. Operates in two modes: speech mode (encoder + projector + LoRA active for ASR/AST) and text mode (pure Granite 3.3 LLM, preserving safety and text capabilities).
Training: Trained on IBM's Blue Vela cluster (NVIDIA H100) using publicly available ASR/AST corpora plus synthetic data targeted at the speech-translation task. Revision 3.3.2 added multilingual inputs (English, French, German, Spanish, Portuguese) and a deeper acoustic encoder for improved English ASR.
Use cases: Enterprise transcription, English and European-language ASR, English↔X speech translation, and as a building block for downstream Granite text workflows.