Nyra Health

CrisperWhisper

A verbatim-transcription variant of OpenAI Whisper fine-tuned by Nyra Health for fast, precise ASR with crisp word-level timestamps, filler detection ('um', 'uh'), and reduced hallucinations. First place on the OpenASR Leaderboard for verbatim datasets.

1.55B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1.55B-parameter dense audio model from Nyra Health. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.55B

ArchitectureDense

ProviderNyra Health

Download Size8.0 GB

Community

Monthly Downloads55.0K

Likes335

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.7%

MBA Open Score

65.0BB

Benchmark40%

86.7

Popularity25%

60.9

Efficiency25%

32.6

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.4 GB
Acer Veriton GN100 AI MiniAcer	SS	1.4 GB
AMD Instinct MI300XAMD	SS	1.4 GB
AMD Instinct MI325XAMD	SS	1.4 GB
AMD Instinct MI355XAMD	SS	1.4 GB
AMD Radeon RX 7600 8GBAMD	SS	1.4 GB
AMD Radeon RX 7700 XTAMD	SS	1.4 GB
AMD Radeon RX 7800 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTXAMD	SS	1.4 GB
AMD Radeon RX 9070AMD	SS	1.4 GB
AMD Radeon RX 9070 XTAMD	SS	1.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.4 GB
Apple M4Apple	SS	1.4 GB
Apple M4 Max (40-core GPU)Apple	SS	1.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple M5Apple	SS	1.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.4 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.4 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.4 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.4 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5070 TiVast.ai · On-Demand · 16 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

CrisperWhisper is a fine-tuned variant of OpenAI’s Whisper large-v3, developed by Nyra Health for production-grade automatic speech recognition (ASR) that prioritizes verbatim transcription and precise word-level timestamps. At 1.55 billion parameters, it occupies the same dense architecture as its base model but is trained specifically to capture disfluencies—fillers like “um” and “uh”, stutters, false starts, and pauses—that standard Whisper models typically omit in favor of an “intended” transcription style.

This matters for any workflow that requires an exact record of spoken language: clinical note-taking, legal depositions, interview analysis, or any scenario where the way something is said carries as much weight as what is said. CrisperWhisper holds first place on the OpenASR Leaderboard for verbatim datasets (TED, AMI) and was accepted at INTERSPEECH 2024. It is licensed under CC-BY-NC-4.0, so commercial use requires a separate agreement.

Architecture & Technical Details

CrisperWhisper is a dense transformer model with 1.55B parameters, identical in size to Whisper large-v3. Unlike mixture-of-experts (MoE) architectures that activate only a subset of parameters per token, dense models use all parameters for every forward pass. This means VRAM consumption scales linearly with parameter count—no surprises—and inference speed is predictable across hardware.

Nyra Health made two key modifications to the base Whisper architecture:

Tokenizer adjustment: The original Whisper tokenizer was reworked to preserve filler words and disfluencies in the output vocabulary, enabling the model to generate verbatim transcriptions rather than cleaned-up versions.
Custom attention loss: During fine-tuning, an additional loss term was applied to the cross-attention scores used for dynamic time warping (DTW) alignment. This dramatically improves timestamp accuracy around disfluencies and pauses—a known weak point in standard Whisper.

The model’s context length is not specified by the provider, but it inherits Whisper’s typical 30-second audio chunk processing. For longer audio, you must segment the input or use a sliding window approach (supported in the official transcribe.py script and via Hugging Face transformers).

Capabilities & Use Cases

CrisperWhisper’s core capability is verbatim ASR with crisp word-level timestamps. It transcribes exactly what was said, including:

Fillers: “um”, “uh”, “hmm”
Stutters and false starts: “I-I-I think”
Pauses and hesitations
Overlapping or fragmented speech (within a single speaker segment)

It also reduces transcription hallucinations—spurious words or phrases that Whisper sometimes invents when audio is noisy or ambiguous.

Concrete use cases:

Clinical documentation: Capture patient speech verbatim, including hesitations that may indicate cognitive or emotional state.
Legal and compliance: Verbatim transcripts of depositions or witness statements where filler words and pauses can be relevant.
Media post-production: Accurate alignment of captions with spoken content, especially around natural speech disfluencies.
Conversation analysis: Research workflows that require fine-grained temporal markers for every word.
Voice user interface debugging: Logging exactly what users said, including false starts, to improve intent recognition.

CrisperWhisper supports English and German out of the box. Other languages may work but are not benchmarked.

Running CrisperWhisper Locally

Because CrisperWhisper is a 1.55B dense model, it runs comfortably on consumer hardware. Here’s what you need to know to run it locally.

VRAM requirements

Precision	VRAM (approx.)	Notes
FP32	~6.2 GB	Full precision, slowest
FP16	~3.1 GB	Good balance for most GPUs
INT8 (quantized)	~1.6 GB	Fast, minimal quality loss
Q4_K_M (GGUF)	~1.0 GB	Best for low-VRAM hardware

Minimum hardware: Any GPU with at least 4 GB VRAM (e.g., GTX 1060 6GB, RTX 3050) can run FP16 inference. For real-time or near-real-time performance, an RTX 3060 12GB or better is recommended.

Recommended hardware:

Consumer GPU: RTX 4070 (12GB) or RTX 4090 (24GB) — Q4_K_M quantization achieves <1 second latency per 30-second audio chunk.
Apple Silicon: M4 Max with 48GB unified memory runs FP16 easily; M2/M3 Pro with 16GB also works with quantization.
CPU-only: Possible with GGUF Q4_K_M on a modern 8-core processor, but expect 5–10x slower than GPU.

Performance expectations

Tokens per second: On an RTX 4090 with Q4_K_M, CrisperWhisper processes approximately 200–300 tokens per second (roughly 5–8 seconds of audio per second of wall time). On an M4 Max, expect 150–200 tokens/sec.
Latency per 30-second clip: ~0.5–1.5 seconds on a modern GPU.

Getting started

The quickest way to run CrisperWhisper locally is via the official transcribe.py script from the GitHub repository, or using Hugging Face transformers. For optimized inference, use faster-whisper (CTranslate2) with the CrisperWhisper model weights converted to the ct2 format. Ollama does not currently support CrisperWhisper; the model is not in the Ollama library, but you can run it directly with Python.

1# Using transformers
2from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
3model = AutoModelForSpeechSeq2Seq.from_pretrained("nyrahealth/CrisperWhisper")
4processor = AutoProcessor.from_pretrained("nyrahealth/CrisperWhisper")

For best performance, use torch.float16 and move the model to GPU.

How It Compares

CrisperWhisper competes directly with other Whisper variants at the 1.5B parameter scale. The two most relevant alternatives are:

OpenAI Whisper large-v3 (1.55B): The base model. CrisperWhisper is a drop-in replacement that adds verbatim transcription and sharper timestamps. If you need clean, intended transcription (e.g., for subtitles where fillers are unwanted), stick with vanilla Whisper. If you need exact speech capture, CrisperWhisper is strictly better.
Faster-Whisper (CTranslate2): An optimized inference engine for Whisper models. It supports quantization and batching, making it faster on CPU and low-end GPUs. You can run CrisperWhisper weights through faster-whisper by converting them to CTranslate2 format. The tradeoff: faster-whisper does not include the custom attention loss or tokenizer adjustments, so timestamp accuracy and verbatim output will revert to base Whisper behavior.

When to choose CrisperWhisper: You need word-level timestamps that are reliable even around hesitations and fillers, and you want to minimize hallucinations. It is the best open model for verbatim ASR at this size.

When to choose an alternative: If you only need standard non-verbatim transcription and want maximum speed on CPU, use faster-whisper with Whisper large-v3. If you need multilingual support beyond English and German, stick with vanilla Whisper large-v3 (which covers 99 languages).

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Nyra Health

CrisperWhisper

1.55B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1.55B-parameter dense audio model from Nyra Health. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.55B

ArchitectureDense

ProviderNyra Health

Download Size8.0 GB

Community

Monthly Downloads55.0K

Likes335

Last Updated2 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.7%

MBA Open Score

65.0BB

Benchmark40%

86.7

Popularity25%

60.9

Efficiency25%

32.6

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.4 GB
Acer Veriton GN100 AI MiniAcer	SS	1.4 GB
AMD Instinct MI300XAMD	SS	1.4 GB
AMD Instinct MI325XAMD	SS	1.4 GB
AMD Instinct MI355XAMD	SS	1.4 GB
AMD Radeon RX 7600 8GBAMD	SS	1.4 GB
AMD Radeon RX 7700 XTAMD	SS	1.4 GB
AMD Radeon RX 7800 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTAMD	SS	1.4 GB
AMD Radeon RX 7900 XTXAMD	SS	1.4 GB
AMD Radeon RX 9070AMD	SS	1.4 GB
AMD Radeon RX 9070 XTAMD	SS	1.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.4 GB
Apple M4Apple	SS	1.4 GB
Apple M4 Max (40-core GPU)Apple	SS	1.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple M5Apple	SS	1.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.4 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.4 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.4 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.4 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.4 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.4 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.4 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5070 TiVast.ai · On-Demand · 16 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Nyra Health made two key modifications to the base Whisper architecture:

Tokenizer adjustment: The original Whisper tokenizer was reworked to preserve filler words and disfluencies in the output vocabulary, enabling the model to generate verbatim transcriptions rather than cleaned-up versions.
Custom attention loss: During fine-tuning, an additional loss term was applied to the cross-attention scores used for dynamic time warping (DTW) alignment. This dramatically improves timestamp accuracy around disfluencies and pauses—a known weak point in standard Whisper.

Capabilities & Use Cases

CrisperWhisper’s core capability is verbatim ASR with crisp word-level timestamps. It transcribes exactly what was said, including:

Fillers: “um”, “uh”, “hmm”
Stutters and false starts: “I-I-I think”
Pauses and hesitations
Overlapping or fragmented speech (within a single speaker segment)

It also reduces transcription hallucinations—spurious words or phrases that Whisper sometimes invents when audio is noisy or ambiguous.

Concrete use cases:

Clinical documentation: Capture patient speech verbatim, including hesitations that may indicate cognitive or emotional state.
Legal and compliance: Verbatim transcripts of depositions or witness statements where filler words and pauses can be relevant.
Media post-production: Accurate alignment of captions with spoken content, especially around natural speech disfluencies.
Conversation analysis: Research workflows that require fine-grained temporal markers for every word.
Voice user interface debugging: Logging exactly what users said, including false starts, to improve intent recognition.

CrisperWhisper supports English and German out of the box. Other languages may work but are not benchmarked.

Running CrisperWhisper Locally

Because CrisperWhisper is a 1.55B dense model, it runs comfortably on consumer hardware. Here’s what you need to know to run it locally.

VRAM requirements

Precision	VRAM (approx.)	Notes
FP32	~6.2 GB	Full precision, slowest
FP16	~3.1 GB	Good balance for most GPUs
INT8 (quantized)	~1.6 GB	Fast, minimal quality loss
Q4_K_M (GGUF)	~1.0 GB	Best for low-VRAM hardware

Minimum hardware: Any GPU with at least 4 GB VRAM (e.g., GTX 1060 6GB, RTX 3050) can run FP16 inference. For real-time or near-real-time performance, an RTX 3060 12GB or better is recommended.

Recommended hardware:

Consumer GPU: RTX 4070 (12GB) or RTX 4090 (24GB) — Q4_K_M quantization achieves <1 second latency per 30-second audio chunk.
Apple Silicon: M4 Max with 48GB unified memory runs FP16 easily; M2/M3 Pro with 16GB also works with quantization.
CPU-only: Possible with GGUF Q4_K_M on a modern 8-core processor, but expect 5–10x slower than GPU.

Performance expectations

Tokens per second: On an RTX 4090 with Q4_K_M, CrisperWhisper processes approximately 200–300 tokens per second (roughly 5–8 seconds of audio per second of wall time). On an M4 Max, expect 150–200 tokens/sec.
Latency per 30-second clip: ~0.5–1.5 seconds on a modern GPU.

Getting started

1# Using transformers
2from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
3model = AutoModelForSpeechSeq2Seq.from_pretrained("nyrahealth/CrisperWhisper")
4processor = AutoProcessor.from_pretrained("nyrahealth/CrisperWhisper")

For best performance, use torch.float16 and move the model to GPU.

How It Compares

CrisperWhisper competes directly with other Whisper variants at the 1.5B parameter scale. The two most relevant alternatives are:

OpenAI Whisper large-v3 (1.55B): The base model. CrisperWhisper is a drop-in replacement that adds verbatim transcription and sharper timestamps. If you need clean, intended transcription (e.g., for subtitles where fillers are unwanted), stick with vanilla Whisper. If you need exact speech capture, CrisperWhisper is strictly better.
Faster-Whisper (CTranslate2): An optimized inference engine for Whisper models. It supports quantization and batching, making it faster on CPU and low-end GPUs. You can run CrisperWhisper weights through faster-whisper by converting them to CTranslate2 format. The tradeoff: faster-whisper does not include the custom attention loss or tokenizer adjustments, so timestamp accuracy and verbatim output will revert to base Whisper behavior.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.