Genmo AI

Mochi 1 Preview

10B parameter Asymmetric Diffusion Transformer (AsymmDiT) with 362M AsymmVAE achieving 128× compression. Visual stream (dim 3072) gets 4× more params than text stream (dim 1536).

10B paramsDense

A solid 10B-parameter dense video generator from Genmo AI. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters10B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

VBench

77.4

Overall Score

55.8BB

Benchmark45%

50.0

Popularity25%

70.0

Efficiency25%

50.0

Versatility5%

65.0


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	6.6 GB
Acer Veriton GN100 AI MiniAcer	SS	6.6 GB
AMD Instinct MI300XAMD	SS	6.6 GB
AMD Instinct MI325XAMD	SS	6.6 GB
AMD Instinct MI355XAMD	SS	6.6 GB
AMD Radeon RX 7700 XTAMD	SS	6.6 GB
AMD Radeon RX 7800 XTAMD	SS	6.6 GB
AMD Radeon RX 7900 XTAMD	SS	6.6 GB
AMD Radeon RX 7900 XTXAMD	SS	6.6 GB
AMD Radeon RX 9070AMD	SS	6.6 GB
AMD Radeon RX 9070 XTAMD	SS	6.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	6.6 GB
Apple M4Apple	SS	6.6 GB
Apple M4 Max (40-core GPU)Apple	SS	6.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	6.6 GB
Apple M5Apple	SS	6.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	6.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	6.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	6.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	6.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	6.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	6.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	6.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	6.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	6.6 GB

About This Model

Mochi 1 Preview is Genmo AI’s open-source video generation model that directly targets the gap between proprietary systems like OpenAI’s Sora and what’s available to run on your own hardware. At 10 billion parameters, it’s not a lightweight toy — it’s a serious video generation model designed for high-fidelity motion synthesis and strong prompt adherence, released under Apache 2.0 license.

This is a text-to-video model, meaning you feed it a text prompt and it outputs a video clip. The architecture is an Asymmetric Diffusion Transformer (AsymmDiT), which represents a deliberate design choice: allocate more compute to the visual stream than the text stream. The visual stream operates at dimension 3072, while the text stream runs at dimension 1536 — roughly 4× more parameters dedicated to video processing. That asymmetry is the key to why Mochi 1 produces coherent motion rather than the warping artifacts common in other open video models.

Mochi 1 Preview competes directly with closed video generation APIs and with other open models like CogVideo. Its claim is straightforward: you can run state-of-the-art video generation locally, not just stream it from a cloud endpoint. That matters for practitioners who need control over their pipeline, privacy for their prompts, or simply want to avoid per-generation costs.

Architecture & Technical Details

The Asymmetric Diffusion Transformer is the headline feature. Most diffusion transformers allocate equal capacity to text and visual processing. Mochi 1 doesn’t — it dedicates 4× more parameters to the visual stream (dim 3072) than the text stream (dim 1536). This isn’t an accident; video generation is fundamentally a visual problem, and the asymmetry reflects that priority.

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

The model uses a 362M parameter Asymmetric VAE (AsymmVAE) that achieves 128× compression. High compression matters for video because raw video data is enormous — you need aggressive latent space reduction to make training and inference tractable. The VAE encodes video frames into a compressed latent representation, the diffusion transformer processes that latent, and the VAE decodes the result back into pixel space.

Mochi 1 is a dense model, not Mixture of Experts. All 10B parameters are active during inference. There’s no routing or conditional computation — every forward pass uses the full parameter count. That means VRAM requirements are straightforward: you need enough memory to load the entire model, plus activation memory for the sequence length you’re generating.

The model uses non-square QKV and output projection layers to reduce inference memory requirements. This is an implementation detail that matters for local deployment — it lowers the memory footprint without sacrificing quality. A single T5-XXL model handles text encoding, which adds its own VRAM cost when running locally.

Context length isn’t specified, but the practical output is video clips up to roughly 5.4 seconds at 480p native resolution (HD in beta). The model generates 31 frames at 848×480 resolution in the standard configuration, with support for up to 84 frames depending on VRAM.

Mochi 1 Preview’s primary capability is text-to-video generation with an emphasis on motion quality. The model was specifically engineered to avoid the “rubber-man” distortion — the warping and stretching artifacts that plague many video generation models when objects move quickly. This makes it suitable for:

Motion-heavy video generation: Action sequences, camera movements, object interactions. The asymmetric architecture prioritizes temporal coherence over static frame quality.
Prompt-adherent generation: The model tracks text descriptions more reliably than many alternatives, particularly for complex scenes with multiple objects or specific actions.
Local video production pipelines: If you need to generate video assets without cloud dependencies — for game development, previsualization, synthetic data generation, or creative tools — Mochi 1 runs entirely on your hardware.

The model outputs video at 480p natively, with higher resolutions in beta. Clip length maxes out around 5 seconds in the standard configuration. If you need longer clips, you’ll need to chain generations or use frame interpolation externally.

This is where Mochi 1 Preview separates itself from closed models — you can run it on consumer hardware, not just datacenter GPUs.

Full precision (fp16): 42GB VRAM required. This is the high-quality configuration using the Diffusers pipeline. You need an NVIDIA A6000, A100, or dual consumer GPUs.
bfloat16 variant: 22GB VRAM minimum. This is the practical target for most local setups. There’s a slight quality drop, but it’s the difference between running the model and not running it.
Quantized (4-bit): Reports indicate 12GB VRAM is feasible with quantization. This is the entry point for consumer GPUs, though quality degrades further.

RTX 4090 (24GB): Can run the bfloat16 variant comfortably. Expect generation times of several minutes per clip at 480p.
RTX 4080 (16GB): Requires 4-bit quantization or aggressive memory optimization. Possible but tight.
RTX 3090 (24GB): Same as 4090 for VRAM purposes, though slower due to memory bandwidth.
M4 Max (up to 128GB unified memory): Can run full precision. Memory bandwidth on Apple Silicon is the bottleneck, not capacity.
RTX 4060/4070 (8-12GB): Only feasible with 4-bit quantization and reduced frame counts.

Expected performance: At 480p resolution with 31 frames and 64 inference steps, expect 2-5 minutes per clip on an RTX 4090 with bfloat16. The primary bottleneck is the attention computation in the 10B parameter transformer — this is not a real-time model. Throughput is measured in clips per hour, not clips per second.

Quickest path to running locally: Use the Hugging Face Diffusers integration. Install the latest diffusers from source, load the model with variant="bf16" and torch_dtype=torch.bfloat16, enable model CPU offloading and VAE tiling, and you’ll get a working pipeline on a 24GB GPU. The Genmo repository also provides a Gradio UI and CLI for direct use.

Memory optimization tips: Enable model_cpu_offload to move components to CPU when not in use. Enable vae_tiling to decode the VAE in chunks. Use fewer frames (31 instead of 84) to reduce sequence length. Reduce inference steps from 64 to 32 for faster generation at lower quality.

vs. CogVideo (9B parameters): Both are open video generation models at similar parameter counts. CogVideo uses a different architecture (Stable Diffusion 3-based) and targets similar use cases. Mochi 1’s asymmetric design gives it an edge in motion coherence — the visual stream gets more capacity specifically for temporal dynamics. CogVideo tends to produce more aesthetically pleasing static frames, but Mochi 1 handles motion better. Choose CogVideo if you care more about frame quality; choose Mochi 1 if motion fidelity matters.

vs. Luma Dream Machine (closed API): Luma’s cloud model produces higher-quality video overall, particularly at higher resolutions and longer clips. But you can’t run it locally — it’s API-only. Mochi 1 gives you local control, no per-generation costs, and the ability to fine-tune or modify the pipeline. The tradeoff is quality and convenience. If you need production-quality video now and have the budget, Luma wins. If you need to integrate video generation into a local pipeline or avoid API costs, Mochi 1 is the better choice.

vs. Sora (closed, unreleased publicly): Mochi 1 is the closest open alternative to Sora’s demonstrated capabilities. Sora reportedly uses a similar diffusion transformer architecture at larger scale. Mochi 1 won’t match Sora’s quality, but it runs on your hardware under a permissive license. For practitioners who want to experiment with state-of-the-art video generation without waiting for access or paying per generation, Mochi 1 is currently the best option in the open ecosystem.

Mochi 1 Preview

Our Take

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

About This Model

Architecture & Technical Details

Find the Best Hardware for This Model

Community

Capabilities & Use Cases

Running Mochi 1 Preview Locally

How It Compares