IBM's compact 2B-parameter speech-language model for multilingual ASR and bidirectional speech translation, ranked #1 on the OpenASR multilingual leaderboard (5.52 average WER) while running efficiently on edge devices.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: Specialized acoustic encoder coupled with the Granite 4.0 1B language backbone:
Modality: Audio-to-text with text-only fallback (the Granite 4.0 backbone). Supports speculative decoding for faster inference.
Training: Trained on IBM's Blue Vela cluster for 30 days (26 days encoder + 4 days projector) on 8 H100 GPUs, using public ASR/AST corpora and synthetic data targeted at Japanese ASR, keyword-biased ASR, and speech translation.
Capabilities & use cases: