Maya Research's 3B-parameter Llama-based TTS model for Hindi, English and Hinglish code-mixed speech at 24 kHz.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Veena is a 3B-parameter text-to-speech model from Maya Research, built on a Llama architecture backbone. It generates 24kHz speech in Hindi, English, and Hinglish code-mixed output using the SNAC neural codec. Released under Apache 2.0, Veena is one of the few open-weight TTS models designed specifically for Indian language synthesis, filling a gap that proprietary APIs have dominated.
Maya Research positions Veena as the foundation layer of their voice intelligence work. The model is autoregressive—it generates audio tokens sequentially rather than through diffusion or flow-matching methods common in other TTS architectures. This means inference patterns resemble those of a language model: you feed text in, get token sequences out, then decode them into audio.
At 3B parameters, Veena sits at the smaller end of the TTS model spectrum. That’s intentional—smaller autoregressive models can run on consumer hardware, unlike the larger diffusion-based TTS systems that typically require datacenter GPUs. The tradeoff is that generation quality and voice control depend heavily on how well the training data covers the desired use case.
Veena uses a dense transformer architecture with 3B parameters, not a mixture-of-experts setup. Every forward pass activates all parameters, which means VRAM consumption is predictable: roughly 6GB at FP16, 3.5GB at 8-bit, and under 2GB with 4-bit quantization. For a TTS model, this is manageable on most modern GPUs.
The model uses the SNAC neural codec for audio decoding at 24kHz. SNAC is a residual vector quantizer that compresses audio into discrete tokens, which the transformer then predicts autoregressively. This is the same approach used by models like AudioLM and SpeechGPT—treat speech generation as a language modeling problem over discrete audio codes.
Key architectural details:
kavya, agastya, maitri, vinaya)The autoregressive approach means latency scales with output length. Maya Research reports sub-80ms latency on H100-80GB GPUs. On consumer hardware, expect longer generation times—the model processes audio tokens sequentially, and each token requires a forward pass.
Veena supports three language modes: Hindi, English, and Hinglish code-mixed speech. The model was trained on over 60,000 proprietary utterances from four professional voice artists, which gives it four distinct speaker voices: kavya, agastya, maitri, and vinaya. Each voice has unique vocal characteristics, though the model card does not specify gender or accent details.
Concrete use cases:
The model does not support speaker adaptation or fine-tuning out of the box—you get the four preset voices. If you need custom voice cloning or accent control, Veena is not the right tool.
Veena’s modest parameter count makes it feasible on consumer hardware, but the autoregressive generation loop means you need enough VRAM to hold both the model and the generated token sequence.
Minimum requirements (4-bit quantization):
Recommended setup (8-bit or FP16):
Veena’s performance depends heavily on your GPU’s memory bandwidth and compute capability. The model generates audio tokens one at a time, so raw token throughput is the bottleneck.
Real-world tests reported by practitioners show that generating a 7-second audio clip takes over 22 seconds on mid-range hardware. This makes Veena unsuitable for real-time applications like live call handling on consumer GPUs. On H100-class hardware, the model achieves sub-80ms latency as advertised.
The quickest path is via the Hugging Face Transformers integration. Install transformers, torch, torchaudio, snac, and bitsandbytes. Load the model with 4-bit quantization using BitsAndBytesConfig and the trust_remote_code=True flag. The model uses special control tokens (START_OF_SPEECH_TOKEN, END_OF_SPEECH_TOKEN, etc.) to delimit speech segments—these are fixed and documented in the model card.
Ollama support is not yet available for Veena. You’ll need to run it directly via the Hugging Face pipeline or a custom inference script.
Vs. Bark (by Suno): Bark is a 1.2B-parameter TTS model that supports multiple languages including Hindi. Bark generates audio with prosody and emotion but uses a different architecture (EnCodec decoder + GPT-style model). Bark requires ~4GB VRAM at FP16 and runs slower than Veena on equivalent hardware—expect 3-5x longer generation times. Bark offers more voice variety and emotional control, but Veena produces cleaner Hindi and Hinglish output due to its targeted training data.
Vs. WhisperSpeech: WhisperSpeech is a 1.5B-parameter TTS model based on the Whisper encoder. It supports English and a few other languages but has no specific Hindi or code-mixed training. WhisperSpeech runs faster than Veena due to its non-autoregressive architecture (parallel generation), but the quality for Indian languages is noticeably worse. If you only need English TTS, WhisperSpeech is faster and smaller. If you need Hindi or Hinglish, Veena is the better choice despite slower inference.
When to choose Veena: You need open-weight TTS for Hindi, English, or Hinglish, you can tolerate non-real-time generation on consumer hardware, and you want Apache 2.0 licensing. If you need real-time performance on consumer GPUs, look at non-autoregressive alternatives or plan to run Veena on datacenter hardware.