
10B parameter Asymmetric Diffusion Transformer (AsymmDiT) with 362M AsymmVAE achieving 128× compression. Visual stream (dim 3072) gets 4× more params than text stream (dim 1536).
A solid 10B-parameter dense video generator from Genmo AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Mochi 1 Preview is Genmo AI’s open-source video generation model that directly targets the gap between proprietary systems like OpenAI’s Sora and what’s available to run on your own hardware. At 10 billion parameters, it’s not a lightweight toy — it’s a serious video generation model designed for high-fidelity motion synthesis and strong prompt adherence, released under Apache 2.0 license.
This is a text-to-video model, meaning you feed it a text prompt and it outputs a video clip. The architecture is an Asymmetric Diffusion Transformer (AsymmDiT), which represents a deliberate design choice: allocate more compute to the visual stream than the text stream. The visual stream operates at dimension 3072, while the text stream runs at dimension 1536 — roughly 4× more parameters dedicated to video processing. That asymmetry is the key to why Mochi 1 produces coherent motion rather than the warping artifacts common in other open video models.
Mochi 1 Preview competes directly with closed video generation APIs and with other open models like CogVideo. Its claim is straightforward: you can run state-of-the-art video generation locally, not just stream it from a cloud endpoint. That matters for practitioners who need control over their pipeline, privacy for their prompts, or simply want to avoid per-generation costs.
The Asymmetric Diffusion Transformer is the headline feature. Most diffusion transformers allocate equal capacity to text and visual processing. Mochi 1 doesn’t — it dedicates 4× more parameters to the visual stream (dim 3072) than the text stream (dim 1536). This isn’t an accident; video generation is fundamentally a visual problem, and the asymmetry reflects that priority.
The model uses a 362M parameter Asymmetric VAE (AsymmVAE) that achieves 128× compression. High compression matters for video because raw video data is enormous — you need aggressive latent space reduction to make training and inference tractable. The VAE encodes video frames into a compressed latent representation, the diffusion transformer processes that latent, and the VAE decodes the result back into pixel space.
Mochi 1 is a dense model, not Mixture of Experts. All 10B parameters are active during inference. There’s no routing or conditional computation — every forward pass uses the full parameter count. That means VRAM requirements are straightforward: you need enough memory to load the entire model, plus activation memory for the sequence length you’re generating.
The model uses non-square QKV and output projection layers to reduce inference memory requirements. This is an implementation detail that matters for local deployment — it lowers the memory footprint without sacrificing quality. A single T5-XXL model handles text encoding, which adds its own VRAM cost when running locally.
Context length isn’t specified, but the practical output is video clips up to roughly 5.4 seconds at 480p native resolution (HD in beta). The model generates 31 frames at 848×480 resolution in the standard configuration, with support for up to 84 frames depending on VRAM.
Mochi 1 Preview’s primary capability is text-to-video generation with an emphasis on motion quality. The model was specifically engineered to avoid the “rubber-man” distortion — the warping and stretching artifacts that plague many video generation models when objects move quickly. This makes it suitable for:
The model outputs video at 480p natively, with higher resolutions in beta. Clip length maxes out around 5 seconds in the standard configuration. If you need longer clips, you’ll need to chain generations or use frame interpolation externally.
This is where Mochi 1 Preview separates itself from closed models — you can run it on consumer hardware, not just datacenter GPUs.
VRAM requirements depend on quantization and precision:
Consumer GPU compatibility:
Expected performance: At 480p resolution with 31 frames and 64 inference steps, expect 2-5 minutes per clip on an RTX 4090 with bfloat16. The primary bottleneck is the attention computation in the 10B parameter transformer — this is not a real-time model. Throughput is measured in clips per hour, not clips per second.
Quickest path to running locally: Use the Hugging Face Diffusers integration. Install the latest diffusers from source, load the model with variant="bf16" and torch_dtype=torch.bfloat16, enable model CPU offloading and VAE tiling, and you’ll get a working pipeline on a 24GB GPU. The Genmo repository also provides a Gradio UI and CLI for direct use.
Memory optimization tips: Enable model_cpu_offload to move components to CPU when not in use. Enable vae_tiling to decode the VAE in chunks. Use fewer frames (31 instead of 84) to reduce sequence length. Reduce inference steps from 64 to 32 for faster generation at lower quality.
vs. CogVideo (9B parameters): Both are open video generation models at similar parameter counts. CogVideo uses a different architecture (Stable Diffusion 3-based) and targets similar use cases. Mochi 1’s asymmetric design gives it an edge in motion coherence — the visual stream gets more capacity specifically for temporal dynamics. CogVideo tends to produce more aesthetically pleasing static frames, but Mochi 1 handles motion better. Choose CogVideo if you care more about frame quality; choose Mochi 1 if motion fidelity matters.
vs. Luma Dream Machine (closed API): Luma’s cloud model produces higher-quality video overall, particularly at higher resolutions and longer clips. But you can’t run it locally — it’s API-only. Mochi 1 gives you local control, no per-generation costs, and the ability to fine-tune or modify the pipeline. The tradeoff is quality and convenience. If you need production-quality video now and have the budget, Luma wins. If you need to integrate video generation into a local pipeline or avoid API costs, Mochi 1 is the better choice.
vs. Sora (closed, unreleased publicly): Mochi 1 is the closest open alternative to Sora’s demonstrated capabilities. Sora reportedly uses a similar diffusion transformer architecture at larger scale. Mochi 1 won’t match Sora’s quality, but it runs on your hardware under a permissive license. For practitioners who want to experiment with state-of-the-art video generation without waiting for access or paying per generation, Mochi 1 is currently the best option in the open ecosystem.