NVIDIA's open-source, full-duplex speech-to-speech conversational model built on the Moshi architecture with a Helium backbone. Processes 24 kHz audio directly for natural, real-time simultaneous listening and speaking.
A solid 7B-parameter dense language model from NVIDIA. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 3.3 GB | Low | |
| Q4_K_MRecommended | 4.8 GB | Good | |
| Q5_K_M | 5.5 GB | Very Good | |
| Q6_K | 6.3 GB | Excellent | |
| Q8_0 | 8.1 GB | Near Perfect | |
| FP16 | 14.7 GB | Full |
See which devices can run this model and at what quality level.
| SS | 48.4 tok/s | 4.8 GB | ||
NVIDIA GeForce RTX 4060NVIDIA | SS | 45.7 tok/s | 4.8 GB | |
| SS | 75.3 tok/s | 4.8 GB | ||
| SS | 72.6 tok/s | 4.8 GB | ||
Intel Arc B580Intel | SS | 76.6 tok/s | 4.8 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 84.7 tok/s | 4.8 GB | |
| SS | 84.7 tok/s | 4.8 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 112.9 tok/s | 4.8 GB | |
| AA | 86.1 tok/s | 4.8 GB | ||
| AA | 104.9 tok/s | 4.8 GB | ||
| AA | 107.6 tok/s | 4.8 GB | ||
| AA | 107.6 tok/s | 4.8 GB | ||
Google Cloud TPU v5eGoogle | AA | 137.7 tok/s | 4.8 GB | |
Intel Arc A770 16GBIntel | AA | 94.1 tok/s | 4.8 GB | |
| AA | 161.4 tok/s | 4.8 GB | ||
| AA | 48.4 tok/s | 4.8 GB | ||
| AA | 112.9 tok/s | 4.8 GB | ||
| AA | 123.7 tok/s | 4.8 GB | ||
| AA | 75.3 tok/s | 4.8 GB | ||
| AA | 150.6 tok/s | 4.8 GB | ||
| AA | 161.4 tok/s | 4.8 GB | ||
| AA | 134.5 tok/s | 4.8 GB | ||
| AA | 161.4 tok/s | 4.8 GB | ||
NVIDIA GeForce RTX 3090NVIDIA | AA | 157.3 tok/s | 4.8 GB | |
| AA | 169.4 tok/s | 4.8 GB |
Energy cost on AMD Radeon RX 7600 8GB (~48 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)PersonaPlex 7B on AMD Radeon RX 7600 8GB · ~48 tok/s · 165W | $0.114 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 5 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.04 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.05 |
PersonaPlex 7B is NVIDIA’s open-source, full-duplex speech-to-speech conversational model. It is built on the Moshi architecture with a Helium backbone and processes 24 kHz audio directly, enabling simultaneous listening and speaking. This is not a cascaded ASR-LLM-TTS pipeline—it is a single Transformer that ingests audio tokens and outputs both text and audio tokens in a streaming, autoregressive fashion.
With 7 billion dense parameters, PersonaPlex occupies a niche that few models address: real-time voice interaction with persona control. It competes directly with Kyutai’s Moshi (the architecture it’s based on) and, at a higher level, with turn-based voice agents that stitch together separate models. For developers building voice-enabled applications that need interruptions, backchannels, and natural turn-taking, PersonaPlex eliminates the latency and complexity of a multi-model stack.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.
PersonaPlex uses a dense Transformer with 7B parameters, not a mixture-of-experts. This means all parameters are active for every forward pass, so inference speed and VRAM consumption are straightforward to predict. The model operates in a dual-stream configuration: one stream processes the user’s incoming audio, the other stream tracks the agent’s own speech and text. Both streams share model state, allowing the agent to continue listening while speaking and to adjust its response when the user interrupts.
The architecture is directly inspired by Moshi, with a neural audio codec (likely NVIDIA’s own or a variant) that encodes 24 kHz audio into discrete tokens. The model predicts both text tokens (representing the content of the response) and audio tokens (representing prosody, voice, and speaking style) autoregressively. This joint prediction is key to achieving full-duplex behavior without explicit segmentation.
Context length is not specified by NVIDIA, but given the streaming nature and the use of audio tokens, the effective context is limited by the audio codec’s window size and the model’s attention span. In practice, PersonaPlex is designed for conversational turns—not long-form document processing.
PersonaPlex excels at real-time voice conversation with persona consistency. Its core capabilities are chat, reasoning, and instruction-following, all mediated through speech. The model supports two forms of persona control:
This dual prompting lets you define both how the agent sounds and what it knows. For example, you can create a customer support agent that speaks like a calm, professional representative and references your product documentation. Or a game NPC that adopts a pirate accent and knows the lore of your world.
Concrete use cases include:
Because PersonaPlex handles both speech understanding and generation in one model, it also retains non-textual cues—tone, emphasis, hesitation—that are lost in text-only pipelines.
PersonaPlex 7B is a dense model, so VRAM scales linearly with precision. Expect the following:
The model also requires the Opus audio codec development library (libopus-dev on Ubuntu) and torch with CUDA 12.x or 13.x. For Blackwell-based GPUs (RTX 50 series), NVIDIA specifically recommends installing PyTorch with CUDA 130 index.
Tokens per second (TPS) for audio tokens are not directly comparable to text token throughput, but on an RTX 4090 with Q4_K_M, you can expect real-time (or better) performance—latency under 300 ms for typical conversational exchanges. On an RTX 3060 12 GB, Q4_K_M may drop to near-real-time, sufficient for non-critical applications.
The fastest path to running PersonaPlex 7B locally:
1# Accept the model license on Hugging Face first, then:2export HF_TOKEN=<your_token>3ollama pull nvidia/personaplex-7b-v14ollama serve
This downloads a pre-quantized Q4_K_M version and sets up a streaming API compatible with OpenAI’s chat completions. The audio codec handling is abstracted by the Ollama integration, but for full-duplex control (barge-in, voice input), you’ll need to use the Python server from the NVIDIA repository.
For full-duplex control and custom voice prompts, install the model directly:
1git clone https://github.com/NVIDIA/personaplex.git2cd personaplex3pip install moshi/.4# For Blackwell GPUs:5pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
Then run python -m moshi.server with your Hugging Face token and audio device configured.
PersonaPlex 7B vs. Moshi (Kyutai)
PersonaPlex is built on the Moshi architecture, so the core technical approach is the same. The key difference is persona control: PersonaPlex adds explicit text and voice prompting for role and voice conditioning. Moshi is more of a general-purpose full-duplex model without targeted persona features. If you need consistent character identity (e.g., for a game NPC or a brand-specific voice assistant), PersonaPlex is the better choice. If you’re building a generic conversation agent, Moshi may be simpler to deploy because it has a smaller community model weight.
PersonaPlex 7B vs. turn-based speech pipelines (Whisper + LLM + TTS)
The cascade approach (e.g., Whisper + Llama 3 8B + XTTS) gives you more flexibility in choosing each component but suffers from higher latency (often 1–3 seconds per turn), no support for interruptions, and loss of prosodic information. PersonaPlex removes the latency penalty and enables natural human-like conversation. The tradeoff: you are locked into a single model that must be fine-tuned for your domain, whereas in a cascade you can independently upgrade ASR, LLM, or TTS.
For developers who want to run a 7B model on a consumer GPU and need real-time voice interaction—not just text chat—PersonaPlex 7B is currently the most pragmatic open-source option.
| $0.05 |
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM | $0.08 |
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM | $0.08 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.