First major open-source foundation model generating fully synchronized 4K video and native audio in a single forward pass. 14B visual + 5B audio params; ~18× faster than Wan 2.2 14B on H100.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
LTX-2 19B is Lightricks' open-source audio-video foundation model that generates fully synchronized 4K video and native audio in a single forward pass. Released in January 2026, it represents the first production-ready open-weight model capable of joint video and audio generation without requiring separate post-production pipelines.
The model uses a dense 19B parameter architecture split asymmetrically: 14B parameters dedicated to visual generation and 5B to audio synthesis. This design allows the model to produce up to 20 seconds of 4K video at 50 frames per second with matching sound effects, dialogue, and ambient audio—all generated simultaneously and automatically synchronized.
Lightricks, the Israel-based company behind consumer apps like Facetune and Videoleap, released LTX-2 under an Apache 2.0-compatible community license. Commercial use is free for companies generating less than $10 million in annual revenue. The full model weights, training code, and documentation are publicly available on GitHub and HuggingFace.
LTX-2 uses a Diffusion Transformer (DiT) architecture with two specialized processing streams that communicate through bidirectional cross-attention layers. The video stream handles spatial detail, motion consistency, and temporal coherence. The audio stream manages sound generation, dialogue timing, and environmental audio. This cross-attention mechanism ensures audio events align precisely with visual cues—when a door closes on screen, the sound occurs at the exact moment; when characters speak, lip movements sync with dialogue automatically.
The model processes inputs through modality-specific VAEs (Variational Autoencoders) that compress raw signals into efficient latent representations. This compression achieves a 1:192 ratio, allowing the model to handle high-resolution content without excessive memory requirements.
LTX-2 supports multiple generation modes through a unified architecture:
The model ships with several checkpoint variants optimized for different use cases. The ltx-2-19b-dev variant is the full model in bf16 precision, suitable for fine-tuning and maximum quality. Quantized versions (fp8, fp4) reduce VRAM requirements for inference-only deployments. The ltx-2-19b-distilled checkpoint uses 8 inference steps with CFG=1, dramatically reducing generation time for applications where absolute maximum quality is less critical than throughput.
Spatial and temporal upscaler models enable multi-stage pipelines for higher resolution and framerate outputs. The x2 spatial upscaler operates on LTX latents to increase resolution, while the x2 temporal upscaler increases frames per second.
LTX-2 excels at generating short-form video content with synchronized audio in a single workflow. The primary use cases for local deployment include:
Content creation: Generate b-roll footage, product demonstrations, or atmospheric clips with matching ambient audio. The image-to-video mode is particularly useful for animating reference images into dynamic scenes.
Game development: Produce placeholder or prototype video sequences with sound effects and environmental audio. The 4K output resolution supports integration into higher-fidelity production pipelines.
Video prototyping: Quickly generate test footage with audio before committing to full production. The distilled checkpoint's fast inference enables iterative prototyping on consumer hardware.
Audio-visual research: The open weights and training code enable experimentation with fine-tuning, LoRA adaptation, and custom training pipelines. The ltx-2-19b-distilled-lora-384 checkpoint provides a lightweight entry point for parameter-efficient fine-tuning.
The model's text-to-video and image-to-video capabilities accept English prompts. Multi-language support is listed in the model card metadata, but performance characteristics for non-English inputs are not documented.
LTX-2 19B requires substantial GPU memory for local inference. The exact VRAM requirements depend on the checkpoint variant and quantization level chosen.
| Checkpoint | Precision | Est. VRAM | Best For |
|------------|-----------|-----------|----------|
| ltx-2-19b-dev | bf16 | 40-48GB | Fine-tuning, maximum quality |
| ltx-2-19b-dev-fp8 | fp8 | 24-32GB | High-quality inference on H100/A100 |
| ltx-2-19b-dev-fp4 | nvfp4 | 16-20GB | Consumer GPU inference |
| ltx-2-19b-distilled | bf16 | 32-40GB | Fast inference, production use |
For consumer GPUs, the fp4 quantized checkpoint (ltx-2-19b-dev-fp4) is the most practical starting point. An RTX 4090 (24GB) can run this variant, though generation times will be longer than on datacenter hardware. The M4 Max (with unified memory configurations of 64GB or 128GB) provides another viable option for macOS workflows, though CUDA-based acceleration will generally offer better performance on equivalent hardware.
For most users running LTX-2 19B on consumer or prosumer hardware, the ltx-2-19b-dev-fp4 checkpoint offers the best balance between quality and accessibility. The fp8 variant is preferable if you have access to an A100 or H100, as it preserves more quality while still reducing VRAM compared to bf16.
The distilled checkpoint (ltx-2-19b-distilled) with 8 inference steps provides the fastest path to generated output, at the cost of some visual fidelity. For applications where iteration speed matters more than absolute quality—prototyping, batch generation, or real-time previews—the distilled version is the recommended choice.
On an H100 SXM, LTX-2 19B generates video approximately 18× faster than Wan 2.2 14B. Specific tokens-per-second or generation-time figures vary significantly based on resolution, duration, and checkpoint selection. The distilled checkpoint with 8 steps will complete generation substantially faster than the full model requiring more denoising steps.
ComfyUI: The recommended path for most users. Install the LTXVideo nodes through ComfyUI Manager for a visual workflow interface. Lightricks maintains official documentation for this integration.
Diffusers: LTX-2 is supported in the HuggingFace Diffusers library for programmatic access. The two-stage generation pipeline is recommended for production quality.
Direct PyTorch: Clone the official repository and follow the installation instructions. Requirements include Python ≥3.12, CUDA >12.7, and PyTorch ~=2.7. The monorepo includes model definitions (ltx-core), pipelines (ltx-pipelines), and training capabilities (ltx-trainer).
Ollama support for LTX-2 provides the fastest path to local experimentation. Once official support lands, running the model requires only a single command after downloading the model weights. Check the Ollama model library for availability and updated instructions.
LTX-2 19B vs. Wan 2.2 14B: Wan 2.2 is a video-only model without native audio generation. LTX-2's primary advantage is the unified audio-video pipeline that eliminates separate audio generation and synchronization workflows. On H100 hardware, LTX-2 is approximately 18× faster for comparable video quality. For projects requiring audio, LTX-2 is the clear choice. For video-only applications where audio is unnecessary, Wan 2.2 may offer simpler deployment.
LTX-2 19B vs. CogVideoX: CogVideoX is another open-source video generation option, but lacks native audio generation. The architecture differences mean each model has distinct strengths in motion handling, prompt adherence, and visual quality. LTX-2's DiT-based design with cross-modal attention provides tighter audio-visual synchronization than models that generate audio separately.
The choice between these models depends on your requirements. If you need synchronized audio in a single pass, LTX-2 19B is the only open-source option at this capability level. If you require video-only generation and have specific architecture preferences, CogVideoX or Wan 2.2 may be worth evaluating for your particular use case.