
A verbatim-transcription variant of OpenAI Whisper fine-tuned by Nyra Health for fast, precise ASR with crisp word-level timestamps, filler detection ('um', 'uh'), and reduced hallucinations. First place on the OpenASR Leaderboard for verbatim datasets.
A solid 1.55B-parameter dense audio model from Nyra Health. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.10 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5070 TiVast.ai · On-Demand · 16 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
CrisperWhisper is a fine-tuned variant of OpenAI’s Whisper large-v3, developed by Nyra Health for production-grade automatic speech recognition (ASR) that prioritizes verbatim transcription and precise word-level timestamps. At 1.55 billion parameters, it occupies the same dense architecture as its base model but is trained specifically to capture disfluencies—fillers like “um” and “uh”, stutters, false starts, and pauses—that standard Whisper models typically omit in favor of an “intended” transcription style.
This matters for any workflow that requires an exact record of spoken language: clinical note-taking, legal depositions, interview analysis, or any scenario where the way something is said carries as much weight as what is said. CrisperWhisper holds first place on the OpenASR Leaderboard for verbatim datasets (TED, AMI) and was accepted at INTERSPEECH 2024. It is licensed under CC-BY-NC-4.0, so commercial use requires a separate agreement.
CrisperWhisper is a dense transformer model with 1.55B parameters, identical in size to Whisper large-v3. Unlike mixture-of-experts (MoE) architectures that activate only a subset of parameters per token, dense models use all parameters for every forward pass. This means VRAM consumption scales linearly with parameter count—no surprises—and inference speed is predictable across hardware.
Nyra Health made two key modifications to the base Whisper architecture:
The model’s context length is not specified by the provider, but it inherits Whisper’s typical 30-second audio chunk processing. For longer audio, you must segment the input or use a sliding window approach (supported in the official transcribe.py script and via Hugging Face transformers).
CrisperWhisper’s core capability is verbatim ASR with crisp word-level timestamps. It transcribes exactly what was said, including:
It also reduces transcription hallucinations—spurious words or phrases that Whisper sometimes invents when audio is noisy or ambiguous.
Concrete use cases:
CrisperWhisper supports English and German out of the box. Other languages may work but are not benchmarked.
Because CrisperWhisper is a 1.55B dense model, it runs comfortably on consumer hardware. Here’s what you need to know to run it locally.
| Precision | VRAM (approx.) | Notes |
|---|---|---|
| FP32 | ~6.2 GB | Full precision, slowest |
| FP16 | ~3.1 GB | Good balance for most GPUs |
| INT8 (quantized) | ~1.6 GB | Fast, minimal quality loss |
| Q4_K_M (GGUF) | ~1.0 GB | Best for low-VRAM hardware |
Minimum hardware: Any GPU with at least 4 GB VRAM (e.g., GTX 1060 6GB, RTX 3050) can run FP16 inference. For real-time or near-real-time performance, an RTX 3060 12GB or better is recommended.
Recommended hardware:
The quickest way to run CrisperWhisper locally is via the official transcribe.py script from the GitHub repository, or using Hugging Face transformers. For optimized inference, use faster-whisper (CTranslate2) with the CrisperWhisper model weights converted to the ct2 format. Ollama does not currently support CrisperWhisper; the model is not in the Ollama library, but you can run it directly with Python.
1# Using transformers2from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor3model = AutoModelForSpeechSeq2Seq.from_pretrained("nyrahealth/CrisperWhisper")4processor = AutoProcessor.from_pretrained("nyrahealth/CrisperWhisper")
For best performance, use torch.float16 and move the model to GPU.
CrisperWhisper competes directly with other Whisper variants at the 1.5B parameter scale. The two most relevant alternatives are:
When to choose CrisperWhisper: You need word-level timestamps that are reliable even around hesitations and fillers, and you want to minimize hallucinations. It is the best open model for verbatim ASR at this size.
When to choose an alternative: If you only need standard non-verbatim transcription and want maximum speed on CPU, use faster-whisper with Whisper large-v3. If you need multilingual support beyond English and German, stick with vanilla Whisper large-v3 (which covers 99 languages).