Maya Research

Veena

Maya Research's 3B-parameter Llama-based TTS model for Hindi, English and Hinglish code-mixed speech at 24 kHz.

3B paramsDense

View on Hugging Face Official Page

Model Specifications

Parameters3B

ArchitectureDense

ProviderMaya Research

Download Size7.6 GB

Community

Monthly Downloads3.2K

Likes232

Last Updated7 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

35.9DD

Benchmark40%

50.0

Popularity25%

33.3

Efficiency25%

4.4

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	2.3 GB
AMD Instinct MI300XAMD	SS	2.3 GB
AMD Instinct MI325XAMD	SS	2.3 GB
AMD Instinct MI355XAMD	SS	2.3 GB
AMD Radeon RX 7600 8GBAMD	SS	2.3 GB
AMD Radeon RX 7700 XTAMD	SS	2.3 GB
AMD Radeon RX 7800 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTXAMD	SS	2.3 GB
AMD Radeon RX 9070AMD	SS	2.3 GB
AMD Radeon RX 9070 XTAMD	SS	2.3 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.3 GB
Apple M4Apple	SS	2.3 GB
Apple M4 Max (40-core GPU)Apple	SS	2.3 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple M5Apple	SS	2.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.3 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.3 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.3 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	2.3 GB

Rows per page

Page 1 of 4

About This Model

Overview

Veena is a 3B-parameter text-to-speech model from Maya Research, built on a Llama architecture backbone. It generates 24kHz speech in Hindi, English, and Hinglish code-mixed output using the SNAC neural codec. Released under Apache 2.0, Veena is one of the few open-weight TTS models designed specifically for Indian language synthesis, filling a gap that proprietary APIs have dominated.

Maya Research positions Veena as the foundation layer of their voice intelligence work. The model is autoregressive—it generates audio tokens sequentially rather than through diffusion or flow-matching methods common in other TTS architectures. This means inference patterns resemble those of a language model: you feed text in, get token sequences out, then decode them into audio.

At 3B parameters, Veena sits at the smaller end of the TTS model spectrum. That’s intentional—smaller autoregressive models can run on consumer hardware, unlike the larger diffusion-based TTS systems that typically require datacenter GPUs. The tradeoff is that generation quality and voice control depend heavily on how well the training data covers the desired use case.

Architecture & Technical Details

Veena uses a dense transformer architecture with 3B parameters, not a mixture-of-experts setup. Every forward pass activates all parameters, which means VRAM consumption is predictable: roughly 6GB at FP16, 3.5GB at 8-bit, and under 2GB with 4-bit quantization. For a TTS model, this is manageable on most modern GPUs.

The model uses the SNAC neural codec for audio decoding at 24kHz. SNAC is a residual vector quantizer that compresses audio into discrete tokens, which the transformer then predicts autoregressively. This is the same approach used by models like AudioLM and SpeechGPT—treat speech generation as a language modeling problem over discrete audio codes.

Key architectural details:

Base architecture: Llama-style transformer (dense, 3B params)
Audio codec: SNAC at 24kHz sampling rate
Generation method: Autoregressive token prediction
Speaker control: Token-based speaker selection (kavya, agastya, maitri, vinaya)
Quantization support: 4-bit (NF4) and 8-bit via bitsandbytes

The autoregressive approach means latency scales with output length. Maya Research reports sub-80ms latency on H100-80GB GPUs. On consumer hardware, expect longer generation times—the model processes audio tokens sequentially, and each token requires a forward pass.

Capabilities & Use Cases

Veena supports three language modes: Hindi, English, and Hinglish code-mixed speech. The model was trained on over 60,000 proprietary utterances from four professional voice artists, which gives it four distinct speaker voices: kavya, agastya, maitri, and vinaya. Each voice has unique vocal characteristics, though the model card does not specify gender or accent details.

Concrete use cases:

Indian language voice applications: IVR systems, voice assistants, and content narration requiring Hindi or Hinglish
Code-mixed speech generation: Scenarios where speakers naturally switch between Hindi and English mid-sentence
Local TTS deployment: Applications that cannot use cloud APIs due to latency, privacy, or cost constraints
Voice prototyping: Testing Indian-language voice interfaces without paying per-character API fees

The model does not support speaker adaptation or fine-tuning out of the box—you get the four preset voices. If you need custom voice cloning or accent control, Veena is not the right tool.

Running Veena Locally

Hardware Requirements

Veena’s modest parameter count makes it feasible on consumer hardware, but the autoregressive generation loop means you need enough VRAM to hold both the model and the generated token sequence.

Minimum requirements (4-bit quantization):

VRAM: 2-3GB
GPU: RTX 3060 12GB, RTX 4060, or any card with 4GB+ VRAM
RAM: 8GB system memory
Storage: ~2GB for model weights

Recommended setup (8-bit or FP16):

VRAM: 6-8GB
GPU: RTX 3090/4090, RTX 4070 Ti Super, or Apple M4 Max (64GB unified memory)
RAM: 16GB system memory
Storage: 3-6GB for model weights

Performance Expectations

Veena’s performance depends heavily on your GPU’s memory bandwidth and compute capability. The model generates audio tokens one at a time, so raw token throughput is the bottleneck.

RTX 4090 (4-bit): ~15-25 tokens/second, roughly 2-4 seconds to generate 7 seconds of audio
RTX 3090 (4-bit): ~10-18 tokens/second
RTX 4060 (4-bit): ~5-10 tokens/second
Apple M4 Max (4-bit): ~8-15 tokens/second via MLX

Real-world tests reported by practitioners show that generating a 7-second audio clip takes over 22 seconds on mid-range hardware. This makes Veena unsuitable for real-time applications like live call handling on consumer GPUs. On H100-class hardware, the model achieves sub-80ms latency as advertised.

Quantization Recommendations

Q4_K_M (4-bit NF4): Best balance of quality and VRAM efficiency. Use this unless you need maximum quality for production output.
8-bit: Slightly better audio quality but requires ~3.5GB VRAM. Worth it if you have the headroom.
FP16: Full precision, requires ~6GB VRAM. Overkill for most use cases—the SNAC codec introduces more quality variance than quantization.

Getting Started

The quickest path is via the Hugging Face Transformers integration. Install transformers, torch, torchaudio, snac, and bitsandbytes. Load the model with 4-bit quantization using BitsAndBytesConfig and the trust_remote_code=True flag. The model uses special control tokens (START_OF_SPEECH_TOKEN, END_OF_SPEECH_TOKEN, etc.) to delimit speech segments—these are fixed and documented in the model card.

Ollama support is not yet available for Veena. You’ll need to run it directly via the Hugging Face pipeline or a custom inference script.

How It Compares

Vs. Bark (by Suno): Bark is a 1.2B-parameter TTS model that supports multiple languages including Hindi. Bark generates audio with prosody and emotion but uses a different architecture (EnCodec decoder + GPT-style model). Bark requires ~4GB VRAM at FP16 and runs slower than Veena on equivalent hardware—expect 3-5x longer generation times. Bark offers more voice variety and emotional control, but Veena produces cleaner Hindi and Hinglish output due to its targeted training data.

Vs. WhisperSpeech: WhisperSpeech is a 1.5B-parameter TTS model based on the Whisper encoder. It supports English and a few other languages but has no specific Hindi or code-mixed training. WhisperSpeech runs faster than Veena due to its non-autoregressive architecture (parallel generation), but the quality for Indian languages is noticeably worse. If you only need English TTS, WhisperSpeech is faster and smaller. If you need Hindi or Hinglish, Veena is the better choice despite slower inference.

When to choose Veena: You need open-weight TTS for Hindi, English, or Hinglish, you can tolerate non-real-time generation on consumer hardware, and you want Apache 2.0 licensing. If you need real-time performance on consumer GPUs, look at non-autoregressive alternatives or plan to run Veena on datacenter hardware.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Maya Research

Veena

Maya Research's 3B-parameter Llama-based TTS model for Hindi, English and Hinglish code-mixed speech at 24 kHz.

3B paramsDense

View on Hugging Face Official Page

Model Specifications

Parameters3B

ArchitectureDense

ProviderMaya Research

Download Size7.6 GB

Community

Monthly Downloads3.2K

Likes232

Last Updated7 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

35.9DD

Benchmark40%

50.0

Popularity25%

33.3

Efficiency25%

4.4

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	2.3 GB
AMD Instinct MI300XAMD	SS	2.3 GB
AMD Instinct MI325XAMD	SS	2.3 GB
AMD Instinct MI355XAMD	SS	2.3 GB
AMD Radeon RX 7600 8GBAMD	SS	2.3 GB
AMD Radeon RX 7700 XTAMD	SS	2.3 GB
AMD Radeon RX 7800 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTAMD	SS	2.3 GB
AMD Radeon RX 7900 XTXAMD	SS	2.3 GB
AMD Radeon RX 9070AMD	SS	2.3 GB
AMD Radeon RX 9070 XTAMD	SS	2.3 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.3 GB
Apple M4Apple	SS	2.3 GB
Apple M4 Max (40-core GPU)Apple	SS	2.3 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple M5Apple	SS	2.3 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.3 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.3 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.3 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.3 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.3 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.3 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.3 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	2.3 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Key architectural details:

Base architecture: Llama-style transformer (dense, 3B params)
Audio codec: SNAC at 24kHz sampling rate
Generation method: Autoregressive token prediction
Speaker control: Token-based speaker selection (kavya, agastya, maitri, vinaya)
Quantization support: 4-bit (NF4) and 8-bit via bitsandbytes

Capabilities & Use Cases

Concrete use cases:

Indian language voice applications: IVR systems, voice assistants, and content narration requiring Hindi or Hinglish
Code-mixed speech generation: Scenarios where speakers naturally switch between Hindi and English mid-sentence
Local TTS deployment: Applications that cannot use cloud APIs due to latency, privacy, or cost constraints
Voice prototyping: Testing Indian-language voice interfaces without paying per-character API fees

The model does not support speaker adaptation or fine-tuning out of the box—you get the four preset voices. If you need custom voice cloning or accent control, Veena is not the right tool.

Running Veena Locally

Hardware Requirements

Veena’s modest parameter count makes it feasible on consumer hardware, but the autoregressive generation loop means you need enough VRAM to hold both the model and the generated token sequence.

Minimum requirements (4-bit quantization):

VRAM: 2-3GB
GPU: RTX 3060 12GB, RTX 4060, or any card with 4GB+ VRAM
RAM: 8GB system memory
Storage: ~2GB for model weights

Recommended setup (8-bit or FP16):

VRAM: 6-8GB
GPU: RTX 3090/4090, RTX 4070 Ti Super, or Apple M4 Max (64GB unified memory)
RAM: 16GB system memory
Storage: 3-6GB for model weights

Performance Expectations

Veena’s performance depends heavily on your GPU’s memory bandwidth and compute capability. The model generates audio tokens one at a time, so raw token throughput is the bottleneck.

RTX 4090 (4-bit): ~15-25 tokens/second, roughly 2-4 seconds to generate 7 seconds of audio
RTX 3090 (4-bit): ~10-18 tokens/second
RTX 4060 (4-bit): ~5-10 tokens/second
Apple M4 Max (4-bit): ~8-15 tokens/second via MLX

Quantization Recommendations

Q4_K_M (4-bit NF4): Best balance of quality and VRAM efficiency. Use this unless you need maximum quality for production output.
8-bit: Slightly better audio quality but requires ~3.5GB VRAM. Worth it if you have the headroom.
FP16: Full precision, requires ~6GB VRAM. Overkill for most use cases—the SNAC codec introduces more quality variance than quantization.

Getting Started

Ollama support is not yet available for Veena. You’ll need to run it directly via the Hugging Face pipeline or a custom inference script.

How It Compares

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.