Miso Labs

Miso One

Miso One (Miso TTS 8B) is an open-weights, English text-to-speech model from Miso Labs built for expressive, emotional delivery. It has about 8.2B parameters and follows a Sesame CSM-style design, pairing a Llama-8B backbone with a smaller Llama-300M audio decoder that produces Mimi audio codes. Miso Labs reports 110 ms time-to-first-byte on its hosted API and supports one-shot voice cloning from a short reference clip. Weights and inference code ship under a Modified MIT License, with a public API listed as coming soon.

8.2B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source tts workloads

A situational 8.2B-parameter dense audio model from Miso Labs. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing. Newly released, so production-readiness is still being shaken out.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters8.2B

ArchitectureDense

ProviderMiso Labs

Download Size32.8 GB

Community

Likes213

Last Updated25 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

32.5DD

Benchmark40%

50.0

Popularity25%

21.7

Efficiency25%

2.2

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	5.5 GB
Acer Veriton GN100 AI MiniAcer	SS	5.5 GB
AMD Instinct MI300XAMD	SS	5.5 GB
AMD Instinct MI325XAMD	SS	5.5 GB
AMD Instinct MI355XAMD	SS	5.5 GB
AMD Radeon RX 7700 XTAMD	SS	5.5 GB
AMD Radeon RX 7800 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTXAMD	SS	5.5 GB
AMD Radeon RX 9070AMD	SS	5.5 GB
AMD Radeon RX 9070 XTAMD	SS	5.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	5.5 GB
Apple M4Apple	SS	5.5 GB
Apple M4 Max (40-core GPU)Apple	SS	5.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple M5Apple	SS	5.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	5.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	5.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	5.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	5.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	5.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	5.5 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.12

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Miso One (Miso TTS 8B) is an open-weights text-to-speech model from Miso Labs built for expressive, emotionally varied English speech. At 8.2B parameters, it’s one of the largest open TTS models available, and it targets a specific gap in the open-source stack: natural conversational delivery with genuine emotional range, not just clean audio.

Most open TTS models prioritize low latency and small footprints at the cost of prosody. Miso One takes the opposite approach—throwing parameters at the problem via a Sesame CSM-style architecture that pairs a large language backbone with a dedicated audio decoder. The result is a model that can shift tone, pacing, and affect based on text content without explicit markup or pitch tuning.

Miso Labs reports 110 ms time-to-first-byte on their hosted API, and the model supports one-shot voice cloning from a short reference clip. Weights and inference code ship under a Modified MIT License, with a public API listed as coming soon. For practitioners evaluating local TTS for voice agents, conversational interfaces, or content generation, Miso One is currently the most serious open contender for emotive speech.

Architecture & Technical Details

Miso One uses a Sesame CSM (Conversational Speech Model) architecture with two transformer components:

Backbone: A Llama-8B-style transformer that processes text and audio-frame embeddings. This is the large language model that handles semantic understanding and prosodic planning.
Audio Decoder: A smaller Llama-300M autoregressive decoder that predicts higher-order audio codebooks within each frame.

The model generates Mimi audio codes—32 codebooks per frame, with codebook 0 predicted from the backbone hidden state and codebooks 1–31 predicted autoregressively by the audio decoder. The text vocabulary is 128,256 tokens, the audio vocabulary is 2,051 tokens, and the maximum sequence length is 2,048.

This is a dense architecture, not Mixture of Experts. All 8.2B parameters are active during inference. At FP16, that means roughly 16 GB of VRAM just to load the weights, plus additional memory for activations and KV cache. The 2,048 token context window is relatively short by LLM standards, but for TTS it’s sufficient for multi-turn conversation and voice continuation tasks.

The Mimi audio tokenizer operates at 48 kHz output, which is higher than the standard 24 kHz or 16 kHz found in many open TTS models. This contributes to audio quality but also increases computational cost per second of generated speech.

Capabilities & Use Cases

Miso One is designed for three primary tasks:

Expressive conversational speech. The model can vary emotion, pacing, and delivery based on text content. This is the core differentiator—most open TTS models produce flat, scripted-sounding output. Miso One can make a character sound hesitant, excited, or commanding without manual parameter tweaking.

One-shot voice cloning. Given a short reference audio clip (roughly 10 seconds), the model can continue speaking in that voice. This works through audio context conditioning—the model processes the reference clip and generates continuation audio that matches the speaker’s timbre and style.

Low-latency voice agent research. Miso Labs’ 110 ms latency claim is for their hosted API, not local inference, but the architecture is designed for streaming use cases. The model generates audio frame-by-frame, which makes it suitable for real-time voice agent pipelines if you have the hardware to keep up.

Current limitations: English only, no multilingual support. The 8.2B parameter count means this is not a lightweight model—it’s built for quality, not portability. The Modified MIT License permits commercial use but includes specific terms around voice cloning consent and watermarking.

Running Miso One Locally

Miso One is not a model you run on a laptop. Here’s what you need to know for local deployment.

VRAM Requirements

Quantization	VRAM (approx.)	Quality Impact
FP16 (full)	16–18 GB	Reference quality
Q8_0	9–10 GB	Minimal degradation
Q4_K_M	5–6 GB	Noticeable but usable
Q4_0	4.5–5 GB	Degraded prosody

At FP16, you need a 24 GB GPU to have headroom for activations and batch processing. A 16 GB GPU (RTX 4060 Ti, RTX 4080) can load the model at FP16 but will be tight on memory for anything beyond single-utterance generation.

Recommended Hardware

RTX 4090 (24 GB): Runs FP16 comfortably. Expect 2–4 seconds of audio generated per second of wall time for single utterances.
RTX 3090 (24 GB): Same capacity as 4090, slightly slower inference.
RTX 4080 (16 GB): Can run Q8_0 or Q4_K_M. FP16 is possible but leaves little room for other processes.
M4 Max / M4 Ultra (64–128 GB unified): Runs FP16 with plenty of headroom. Unified memory eliminates VRAM constraints, but raw throughput is lower than a 4090.
Dual GPU setups: Possible with model parallelism, but not officially supported in the current inference code.

Getting Started

The official repository at MisoLabsAI/MisoTTS provides the inference code and setup instructions. The quickest path:

1git clone https://github.com/MisoLabsAI/MisoTTS.git
2cd MisoTTS
3uv sync --python 3.10
4source .venv/bin/activate
5uv run python run_misotts.py

This downloads the model weights from Hugging Face and generates a sample conversation. For production use, you’ll want to integrate the generator.py module into your own pipeline.

Performance Notes

Local inference latency is significantly higher than Miso Labs’ hosted API. Expect 500 ms to 2 seconds for the first audio frame on consumer hardware, depending on GPU and quantization. Streaming generation improves perceived latency but requires careful pipeline design.

GGUF and EXL2 quantizations are not yet available as of initial release, but community conversions are likely to appear quickly given the model’s popularity.

How It Compares

vs. XTTSv2 (Coqui): XTTSv2 is smaller (~1.6B parameters) and runs on far less hardware—8 GB VRAM is sufficient. It supports multilingual TTS and voice cloning. Miso One wins on emotional expressiveness and audio quality (48 kHz vs 24 kHz). XTTSv2 wins on accessibility, speed, and language coverage. Choose Miso One if you need natural conversational delivery and have the GPU budget. Choose XTTSv2 if you need multilingual support or are running on constrained hardware.

vs. Fish Speech 1.5: Fish Speech is also smaller (~500M–1B parameters) and supports multiple languages with voice cloning. It’s faster and more hardware-efficient. Miso One produces more emotionally varied output and higher sample rates. Fish Speech is the practical choice for production pipelines on consumer GPUs. Miso One is the quality choice for voice agents where natural delivery matters more than throughput.

vs. ElevenLabs (proprietary): ElevenLabs offers superior quality and latency but is a paid API. Miso One is open weights and can run locally. If you need the absolute best quality and have budget, ElevenLabs wins. If you need data sovereignty, no per-character costs, or the ability to fine-tune, Miso One is the better option.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

8.2B

Miso Labs

Miso One

8.2B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source tts workloads

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters8.2B

ArchitectureDense

ProviderMiso Labs

Download Size32.8 GB

Community

Likes213

Last Updated25 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

32.5DD

Benchmark40%

50.0

Popularity25%

21.7

Efficiency25%

2.2

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	5.5 GB
Acer Veriton GN100 AI MiniAcer	SS	5.5 GB
AMD Instinct MI300XAMD	SS	5.5 GB
AMD Instinct MI325XAMD	SS	5.5 GB
AMD Instinct MI355XAMD	SS	5.5 GB
AMD Radeon RX 7700 XTAMD	SS	5.5 GB
AMD Radeon RX 7800 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTXAMD	SS	5.5 GB
AMD Radeon RX 9070AMD	SS	5.5 GB
AMD Radeon RX 9070 XTAMD	SS	5.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	5.5 GB
Apple M4Apple	SS	5.5 GB
Apple M4 Max (40-core GPU)Apple	SS	5.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple M5Apple	SS	5.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	5.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	5.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	5.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	5.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	5.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	5.5 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.12

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Miso One uses a Sesame CSM (Conversational Speech Model) architecture with two transformer components:

Backbone: A Llama-8B-style transformer that processes text and audio-frame embeddings. This is the large language model that handles semantic understanding and prosodic planning.
Audio Decoder: A smaller Llama-300M autoregressive decoder that predicts higher-order audio codebooks within each frame.

Capabilities & Use Cases

Miso One is designed for three primary tasks:

Running Miso One Locally

Miso One is not a model you run on a laptop. Here’s what you need to know for local deployment.

VRAM Requirements

Quantization	VRAM (approx.)	Quality Impact
FP16 (full)	16–18 GB	Reference quality
Q8_0	9–10 GB	Minimal degradation
Q4_K_M	5–6 GB	Noticeable but usable
Q4_0	4.5–5 GB	Degraded prosody

Recommended Hardware

RTX 4090 (24 GB): Runs FP16 comfortably. Expect 2–4 seconds of audio generated per second of wall time for single utterances.
RTX 3090 (24 GB): Same capacity as 4090, slightly slower inference.
RTX 4080 (16 GB): Can run Q8_0 or Q4_K_M. FP16 is possible but leaves little room for other processes.
M4 Max / M4 Ultra (64–128 GB unified): Runs FP16 with plenty of headroom. Unified memory eliminates VRAM constraints, but raw throughput is lower than a 4090.
Dual GPU setups: Possible with model parallelism, but not officially supported in the current inference code.

Getting Started

The official repository at MisoLabsAI/MisoTTS provides the inference code and setup instructions. The quickest path:

1git clone https://github.com/MisoLabsAI/MisoTTS.git
2cd MisoTTS
3uv sync --python 3.10
4source .venv/bin/activate
5uv run python run_misotts.py

This downloads the model weights from Hugging Face and generates a sample conversation. For production use, you’ll want to integrate the generator.py module into your own pipeline.

Performance Notes

GGUF and EXL2 quantizations are not yet available as of initial release, but community conversions are likely to appear quickly given the model’s popularity.

How It Compares

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.