IBM's compact 2B-parameter speech-language model for multilingual ASR and bidirectional speech translation, ranked #1 on the OpenASR multilingual leaderboard (5.52 average WER) while running efficiently on edge devices.
A solid 2B-parameter dense audio model from IBM. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Architecture: Specialized acoustic encoder coupled with the Granite 4.0 1B language backbone:
Modality: Audio-to-text with text-only fallback (the Granite 4.0 backbone). Supports speculative decoding for faster inference.
Training: Trained on IBM's Blue Vela cluster for 30 days (26 days encoder + 4 days projector) on 8 H100 GPUs, using public ASR/AST corpora and synthetic data targeted at Japanese ASR, keyword-biased ASR, and speech translation.