Miso One (Miso TTS 8B) is an open-weights, English text-to-speech model from Miso Labs built for expressive, emotional delivery. It has about 8.2B parameters and follows a Sesame CSM-style design, pairing a Llama-8B backbone with a smaller Llama-300M audio decoder that produces Mimi audio codes. Miso Labs reports 110 ms time-to-first-byte on its hosted API and supports one-shot voice cloning from a short reference clip. Weights and inference code ship under a Modified MIT License, with a public API listed as coming soon.
A situational 8.2B-parameter dense audio model from Miso Labs. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM | $0.07 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.08 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.08 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM | $0.12 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Miso One (Miso TTS 8B) is an open-weights text-to-speech model from Miso Labs built for expressive, emotionally varied English speech. At 8.2B parameters, it’s one of the largest open TTS models available, and it targets a specific gap in the open-source stack: natural conversational delivery with genuine emotional range, not just clean audio.
Most open TTS models prioritize low latency and small footprints at the cost of prosody. Miso One takes the opposite approach—throwing parameters at the problem via a Sesame CSM-style architecture that pairs a large language backbone with a dedicated audio decoder. The result is a model that can shift tone, pacing, and affect based on text content without explicit markup or pitch tuning.
Miso Labs reports 110 ms time-to-first-byte on their hosted API, and the model supports one-shot voice cloning from a short reference clip. Weights and inference code ship under a Modified MIT License, with a public API listed as coming soon. For practitioners evaluating local TTS for voice agents, conversational interfaces, or content generation, Miso One is currently the most serious open contender for emotive speech.
Miso One uses a Sesame CSM (Conversational Speech Model) architecture with two transformer components:
The model generates Mimi audio codes—32 codebooks per frame, with codebook 0 predicted from the backbone hidden state and codebooks 1–31 predicted autoregressively by the audio decoder. The text vocabulary is 128,256 tokens, the audio vocabulary is 2,051 tokens, and the maximum sequence length is 2,048.
This is a dense architecture, not Mixture of Experts. All 8.2B parameters are active during inference. At FP16, that means roughly 16 GB of VRAM just to load the weights, plus additional memory for activations and KV cache. The 2,048 token context window is relatively short by LLM standards, but for TTS it’s sufficient for multi-turn conversation and voice continuation tasks.
The Mimi audio tokenizer operates at 48 kHz output, which is higher than the standard 24 kHz or 16 kHz found in many open TTS models. This contributes to audio quality but also increases computational cost per second of generated speech.
Miso One is designed for three primary tasks:
Expressive conversational speech. The model can vary emotion, pacing, and delivery based on text content. This is the core differentiator—most open TTS models produce flat, scripted-sounding output. Miso One can make a character sound hesitant, excited, or commanding without manual parameter tweaking.
One-shot voice cloning. Given a short reference audio clip (roughly 10 seconds), the model can continue speaking in that voice. This works through audio context conditioning—the model processes the reference clip and generates continuation audio that matches the speaker’s timbre and style.
Low-latency voice agent research. Miso Labs’ 110 ms latency claim is for their hosted API, not local inference, but the architecture is designed for streaming use cases. The model generates audio frame-by-frame, which makes it suitable for real-time voice agent pipelines if you have the hardware to keep up.
Current limitations: English only, no multilingual support. The 8.2B parameter count means this is not a lightweight model—it’s built for quality, not portability. The Modified MIT License permits commercial use but includes specific terms around voice cloning consent and watermarking.
Miso One is not a model you run on a laptop. Here’s what you need to know for local deployment.
| Quantization | VRAM (approx.) | Quality Impact |
|---|---|---|
| FP16 (full) | 16–18 GB | Reference quality |
| Q8_0 | 9–10 GB | Minimal degradation |
| Q4_K_M | 5–6 GB | Noticeable but usable |
| Q4_0 | 4.5–5 GB | Degraded prosody |
At FP16, you need a 24 GB GPU to have headroom for activations and batch processing. A 16 GB GPU (RTX 4060 Ti, RTX 4080) can load the model at FP16 but will be tight on memory for anything beyond single-utterance generation.
The official repository at MisoLabsAI/MisoTTS provides the inference code and setup instructions. The quickest path:
1git clone https://github.com/MisoLabsAI/MisoTTS.git2cd MisoTTS3uv sync --python 3.104source .venv/bin/activate5uv run python run_misotts.py
This downloads the model weights from Hugging Face and generates a sample conversation. For production use, you’ll want to integrate the generator.py module into your own pipeline.
Local inference latency is significantly higher than Miso Labs’ hosted API. Expect 500 ms to 2 seconds for the first audio frame on consumer hardware, depending on GPU and quantization. Streaming generation improves perceived latency but requires careful pipeline design.
GGUF and EXL2 quantizations are not yet available as of initial release, but community conversions are likely to appear quickly given the model’s popularity.
vs. XTTSv2 (Coqui): XTTSv2 is smaller (~1.6B parameters) and runs on far less hardware—8 GB VRAM is sufficient. It supports multilingual TTS and voice cloning. Miso One wins on emotional expressiveness and audio quality (48 kHz vs 24 kHz). XTTSv2 wins on accessibility, speed, and language coverage. Choose Miso One if you need natural conversational delivery and have the GPU budget. Choose XTTSv2 if you need multilingual support or are running on constrained hardware.
vs. Fish Speech 1.5: Fish Speech is also smaller (~500M–1B parameters) and supports multiple languages with voice cloning. It’s faster and more hardware-efficient. Miso One produces more emotionally varied output and higher sample rates. Fish Speech is the practical choice for production pipelines on consumer GPUs. Miso One is the quality choice for voice agents where natural delivery matters more than throughput.
vs. ElevenLabs (proprietary): ElevenLabs offers superior quality and latency but is a paid API. Miso One is open weights and can run locally. If you need the absolute best quality and have budget, ElevenLabs wins. If you need data sovereignty, no per-character costs, or the ability to fine-tune, Miso One is the better option.