LTX-2.3 ComfyUI Guide: Setup, Workflow, and Video & Audio Generation

In this article, I'll introduce how to use the AI video model "LTX Video 2.3." LTX Video is an API and open-source video generation model released by 🔗Lightricks, the Israeli company famous for its apps Videoleap and Photoleap, and a key feature is that despite its DiT-based architecture, it runs relatively lightweight. As a video generation model, it stands as a strong option alongside Wan 2.2 and Hunyuan. LTX-2.3, released in March 2026 and covered in this article, is a 22B-parameter open-weight model that achieves simultaneous video and audio generation, and it boasts some of the best performance among current open-source models. This article introduces the generation steps using the official ComfyUI template. Let's first cover the basics of LTX-2.3, then expand the workflow to aim for "high-quality video."
What You'll Learn in This Article
- The features of LTX-2.3 and how it differs from Wan 2.2 and Hunyuan
- How to set up LTX-2.3 in ComfyUI and download the required models
- How to use the official ComfyUI "LTX-2.3: Image to Video" template
- How to write prompts for LTX-2.3, plus some tips
- A comparison of generation speed and quality with WAN2.2
- How to use a custom workflow aimed at higher-quality video (💎 Members only)
- How to deal with noise and color drift that occur during high-resolution generation (💎 Members only)
- Verification results for Japanese voice generation with LTX-2.3 (💎 Members only)
What Is LTX-2.3?
"LTX-2.3" is a large-scale, 22B-parameter video generation model that adopts a asymmetric dual-stream Diffusion Transformer (DiT) architecture, combining the video stream and audio stream through bidirectional cross-attention. Note that some explainer sites list figures that don't add up, such as "video 14B + audio 5B = 22B total," which may be a case of confusion between the specs of the older model and the total parameter count of the new one. The official source hasn't published a more detailed breakdown beyond this, but the latest LTX-2.3 is a 22B (22 billion parameter) model overall. By inputting text or a reference image, it can generate fast, high-quality video, and the ability to output video and audio simultaneously in a single inference pass is a major point of differentiation from other open-source models.
- Supported tasks: Supports text-to-video (T2V), image-to-video (I2V), and text-to-audio (T2A).
- Supported resolutions: Depending on your hardware, it natively supports resolutions in multiples of 32, and can go up to native 4K after upscaling. Vertical format (9:16) is also natively supported up to 1080x1920. Depending on your settings, it can generate up to about 20 seconds of video.
- Model variants: Officially, there are two variants: the full model (BF16 precision, intended for fine-tuning) and a distilled model (8-step fast inference).
- Required GPU memory: The BF16 full model requires at least 32GB (recommended: RTX5090 / A100 80GB / H100). The fp8 quantized version (used in this article) requires at least 16GB.
- Output frame rate: Choose from 24fps (cinematic), 25fps (standard), or 30fps (smooth). The frame count is calculated as "duration × fps + 1" (e.g., 5 seconds × 25fps + 1 = 126 frames).
- License: Released under the Apache 2.0 license.
Comparison with the Previous Version
The main changes from LTX 2.0 (released October 2025) include a revamped VAE that improves fine detail clarity, a text connector expanded fourfold to improve prompt adherence, and a new HiFi-GAN-based vocoder that delivers clean audio generation with stereo 24kHz output. It also fixed the excessive Ken Burns effect that had been an issue in I2V, and now supports native generation of vertical (9:16) videos.
Setting Up LTX-2.3
To run the LTX-2.3 workflow, you'll need to update ComfyUI to the latest version. (At the time of writing, the ComfyUI version is 🔗v0.25.0.) ⚠️ If you're updating ComfyUI in an environment where SageAttention is already installed, be careful about Torch/CUDA version compatibility. I've added update instructions to the article below, so please check that too.
Exploring the ComfyUI LTX-2.3: Image to Video Template
The "LTX-2.3: Image to Video" template uses a two-stage generation setup (low resolution → Spatial Upscale → high-resolution refinement). The main nodes are grouped inside the "Image to Video (LTX-2.3)" subgraph.
First, let's check out the "LTX-2.3: Image to Video" workflow example from the template.
Open the template list and select Video from GENERATION TYPE in the left-side menu.
Video-related templates will appear, so select "LTX-2.3: Image to Video."
When you open the template, any missing models will be shown. If you download them as instructed, you can run it as is. ✅ If you're not sure how, the next section explains this in detail.
The official documentation is below.

Downloading the Models
Model files need to be placed in the correct folders. Download the files below and place them in the specified folders under ComfyUI/models. When you open the workflow in ComfyUI, download buttons will also appear on the nodes, so you can get them from there as well.
Example placement:
ComfyUI/
├── 📁 models/
│ ├── 📁 checkpoints/
│ │ └── ltx-2.3-22b-dev-fp8.safetensors
│ ├── 📁 text_encoders/
│ │ └── gemma_3_12B_it_fp4_mixed.safetensors
│ ├── 📁 loras/
│ │ ├── ltx_2.3_22b_distilled_1.1_lora_dynamic_fro09_avg_rank_111_bf16.safetensors
│ │ └── gemma-3-12b-it-abliterated_lora_rank64_bf16.safetensors
│ └── 📁 latent_upscale_models/
│ └── ltx-2.3-spatial-upscaler-x2-1.1.safetensors
✅ The Text Encoder file gemma_3_12B_it_fp4_mixed.safetensors is Gemma 3 12B quantized with FP4 mixed precision. FP4 uses even fewer bits than FP8, but LTX-2.3's LTXAVTextEncoderLoader supports it and is designed to keep the impact on quality to a minimum. The FP8 version (gemma_3_12B_it_fp8_scaled.safetensors) can also be used with the same node.
✅ There are 1.5x and 2x upscaler models. The official WF uses the 2x model.
Input Assets for the Image to Video (LTX-2.3) Subgraph
The two input images used in Image to Video (LTX-2.3) weren't included in the official documentation. If you open it in ComfyUI Cloud, you can download the images.
If you just want the assets, I've also uploaded the same files to the drive below.

Nodes in the LTX-2.3: Image to Video Template
The main nodes are arranged inside the main "LTX-2.3: Image to Video" subgraph, so let's go through them in order.
Video Settings Group
Switch to Text to Video?
This node switches between I2V and T2V. You can toggle the boolean between True/False, and setting it to True bypasses the input image so you can use it as T2V.
🔳 Model Group
Load LoRA
This node applies LoRA to the MODEL only. It applies the distilled LoRA (ltx_2.3_22b_distilled_1.1_lora_dynamic_fro09_avg_rank_111_bf16.safetensors) to the video Diffusion model at a strength of 0.5, making it possible to generate high-quality video with as few as about 8 steps. Since it doesn't affect CLIP, the text encoding behavior doesn't change.
Load LTXV Audio VAE
This node loads the audio VAE from a checkpoint. In LTX-2.3, the audio VAE is bundled inside the main checkpoint (ltx-2.3-22b-dev-fp8.safetensors), so you specify the same file. The output audio VAE is connected to LTXVEmptyLatentAudio and LTXVAudioVAEDecode.
LTXV Audio Text Encoder Loader
This node loads a text encoder that supports both the video and audio streams. Specify the Gemma file (gemma_3_12B_it_fp4_mixed.safetensors) in the Text Encoder field and the main model in the Checkpoint field, and the text projection layer is retrieved automatically. The output CLIP is used to encode both the Positive and Negative prompts.
Load Latent Upscale Model
This loads ltx-2.3-spatial-upscaler-x2-1.1.safetensors. It's used to scale up the low-resolution latents generated in Stage 1 by 2x for Stage 2.
🔳 Image Preprocess Group
LTXVPreprocess
This node normalizes the reference image into LTX-2.3's input format for I2V. It scales the pixel values and applies compression via the img_compression parameter (0-100, default 35), shaping the reference image so the model can more easily embed it into the latent space.
The img_compression value is actually the CRF value for H.264 (libx264) encoding — the mechanism re-encodes and decodes the reference image as a single-frame MP4 video to artificially add compression artifacts. The higher the value, the stronger the compression, which increases block noise and detail loss; the lower the value, the closer the image quality stays to the original (setting 0 skips the compression process entirely and outputs the original image as is). The reasoning given in the community for setting the value higher is that "since LTX-family models are trained on video data that contains compression noise, adding compression noise to the reference image as well makes it easier for the model to recognize it as a single frame of video rather than a still image" — but note that this is anecdotal and not an officially confirmed mechanism.
🔳 Empty Latent Group
EmptyLTXVLatentVideo
This node generates an empty latent tensor for video at the specified width, height, and frame count. In Stage 1, it's created at half the size of the final output (Width=768, Height=512), which efficiently generates rough motion. The frame count (length) is automatically calculated using the "Duration × FPS + 1" formula, based on the values entered in the "Image to Video (LTX-2.3)" subgraph.
LTXVEmptyLatentAudio
This node generates an empty audio latent synchronized with the video's frame count and frame rate. Through the audio VAE, it initializes an audio latent space that shares the same timeline as the video latent. This is needed to generate video and audio simultaneously with the same sampler.
LTXVImgToVideoInplace
This node embeds the reference image into the first frame of the video latent (Inplace = it rewrites the latent directly). You can adjust how strongly the reference image affects the result via the strength parameter, and setting bypass=true makes it operate in T2V mode. In this workflow, it's used twice: in Stage 1 (strength=0.7) and Stage 2 (strength=1.0).
LTXVConcatAVLatent
This node combines the separate video latent and audio latent into a single AV latent. Since LTX-2.3's dual-stream architecture generates video and audio simultaneously in a single sampling pass, they need to be merged with this node before being passed to the sampler.
🔳 Prompt Group
LTXVConditioning
This node adds frame rate information to the Positive/Negative conditioning. This is where you tell the model what frame rate to generate the motion at.
🔳 Prompt Enhancement
Load LoRA (Model and CLIP)
This node applies LoRA to both MODEL and CLIP. In this workflow, gemma-3-12b-it-abliterated_lora_rank64_bf16.safetensors is connected to CLIP, reflecting the behavior of the "abliterated" version — which loosens Gemma 3's content filtering — in the text encoder.
Generate LTX2 Prompt
This node uses Gemma 3 to automatically expand the user's prompt. It converts a short instruction into a detailed descriptive paragraph, shaping it so LTX-2.3 can accurately interpret the intent. You can toggle this with the "Enable Prompt Enhance" boolean in the "Image to Video (LTX-2.3)" subgraph — setting it to False uses the input prompt as is.
🔳 Generate Low Resolution Group
ManualSigmas
This node manually specifies the denoising schedule (sigma values). A comma-separated list of numbers represents the noise strength at each step, and the number of values determines the step count. In this workflow, 8 steps' worth of values are set for Stage 1 and 3 steps' worth for Stage 2.
LTXVSeparateAVLatent
This node separates the post-sampling AV latent back into video and audio. It's the reverse of LTXVConcatAVLatent — the separated video latent is sent on to decoding or upscaling, and the audio latent is sent to the audio VAE decode.
LTXVCropGuides
This node adjusts the conditioning information to match the actual size of the latent. It corrects dimensional mismatches that arise after padding or resolution conversion, so the model can reference conditioning that corresponds to the correct spatial position.
🔳 Latent Upscale Group
LTXVLatentUpsampler
This node upscales the low-resolution video latent generated in Stage 1 using the Spatial Upscaler. Since it scales up 2x (when using the x2 model) directly in latent space without decoding to pixel space, it can bridge to the high-resolution Stage 2 while saving VRAM.
🔳 Ungrouped
LTXVAudioVAEDecode
This node converts the audio latent into waveform data. It decodes using the audio VAE and outputs AUDIO data that can be combined with the video in the downstream CreateVideo node.
How to Write Prompts
LTX-2.3 prompts are basically written in English.
The role of the prompt differs between T2V (text-to-video) and I2V (image-to-video).
- T2V: You specify everything about the video — content, appearance, and motion — through the prompt. Describe the subject's appearance, background, color tone, and motion in detail.
- I2V: Since the input image handles the visual content, the prompt can focus mainly on instructions for motion and action.
- Positive Prompt: Write it in natural language rather than as bullet points or a comma-separated list of keywords (see the "Prompt Design" section below for details).
- Negative Prompt: Values like
pc game, console game, video game, cartoon, childish, uglyare set as the default. - Prompt Enhancement: Setting the "Enable Prompt Enhance" boolean in the "Image to Video (LTX-2.3)" subgraph to True lets Gemma 3 automatically expand the prompt. This makes it easier to generate detailed video even from a short prompt.
The Two-Stage Generation Process
This workflow generates video in two stages: low resolution → Spatial Upscale → high resolution.
- Stage 1 (Generate Low Resolution):
EmptyLTXVLatentVideogenerates a low-resolution latent with Width and Height halved, andSamplerCustomAdvancedplusManualSigmas(8 steps) generates the rough motion. The sampler iseulerand the CFG is1.0. - Latent Upscale:
LTXVLatentUpsampleruses the Spatial Upscaler x2 to scale the latent up by 2x. - Stage 2 (Generate High Resolution):
SamplerCustomAdvancedplusManualSigmas(3 steps) refines the details of the upscaled latent.
Decoding and Exporting
- VAEDecodeTiled: Converts Stage 2's video latent into frame images. Tiled processing lets it decode even high resolutions while saving VRAM.
- LTXVAudioVAEDecode: Converts the audio latent into a waveform.
- CreateVideo: Combines the video frames and audio to generate VIDEO data. FPS is automatically retrieved from the Frame Rate primitive.
- SaveVideo: Saves it as a video file.
How to Use the LTX-2.3: Image to Video Template
Now let's actually try using the template. It's very simple to use.
- Load the input image: Load the asset
egyptian_queen.pnginto "Load Image." - Load each model: Check that the models in the "Image to Video (LTX-2.3)" subgraph are loaded correctly.
- Run generation: Once the input image and each model are loaded correctly, run it with the "Run" button.
After a short while, the generated result will appear in "Save Video."
If you'd like to try T2V with this workflow, you can switch by setting the value of "Switch to Text to Video?" inside the "Image to Video (LTX-2.3)" subgraph to true.
Generation Results with the Official Workflow
The generation result came out as follows. The audio is muted, so unmute it if you'd like to listen. ✅ The video has been downscaled for the web.
As a reference for generation speed, it took 182 seconds on my setup (RTX3090). (SageAttention is not used.) ✅ Disabling the prompt enhancer makes it even faster. If you have limited VRAM, I recommend disabling the prompt enhancer.
LTX-2.3 Prompt Design Guide
🔗LTX-2.3's official prompting guide recommends writing prompts in English as a single flowing paragraph. Rather than bullet points or a comma-separated list of keywords, writing 4-8 natural sentences using present-tense verbs makes it easier for the Gemma 3 text encoder to accurately interpret your intent.
As mentioned earlier, the role the prompt plays differs between T2V and I2V.
- T2V: You specify everything about the video (appearance, scene, motion) through the prompt. It's effective to cover as many of the 6 elements below as possible.
- I2V: Since the input image handles the visual content, it's effective to focus the prompt on instructions for motion, camera work, and audio. You can leave character definition and scene-building to the image.
The 6 elements recommended for T2V are as follows.
- Shot setup: Cinematography terms and scale specification (e.g.,
medium close-up,wide establishing shot) - Scene-building: Lighting, color palette, texture, and mood
- Action description: Describe it as a natural flow from beginning to end (present tense)
- Character definition: Age, hairstyle, outfit, and distinctive features
- Camera movement: Specify the timing and direction (e.g.,
slow dolly in,handheld tracking shot). Also describing how the subject looks after the movement improves generation accuracy. - Audio elements: Describe ambient sound, music, and dialogue. Enclose dialogue in quotation marks, and specify the language or accent if needed.
A young woman with curly auburn hair walks through a sunlit autumn forest, her long coat brushing fallen leaves. Warm golden light filters through the canopy, casting dappled shadows on the path. The camera begins with a wide establishing shot, then slowly dollies in as she pauses and looks up. A gentle breeze moves the leaves overhead.
Negative prompt: low quality, worst quality, deformed, distorted, fused fingers, bad anatomy, motion smear, weird hand, ugly
✅ Patterns to avoid: describing internal emotions like "she felt sad" (express emotion through visual cues like facial expressions and gestures instead), readable text or signage, complex physics (like splashing liquid — though dancing is fine), overly complicated scenes (many characters or actions happening at once), conflicting lighting (mixing contradictory light sources), and over-complicating the prompt (start simple and add elements gradually).
You can technically write the WebUI or ComfyUI prompt emphasis syntax for models like LTX-2.3 or WAN, but it won't have the effect you expect. This syntax was originally devised as a hack for CLIP encoders, so on models that don't use CLIP, it rarely emphasizes the intended word the way it does with CLIP-based models. Let's look at the example below.
A young woman with (curly auburn hair: 1.2) walks through a sunlit autumn forest,In this case, the parentheses and weight value are stripped out and processed internally as a weighting adjustment, but unlike with CLIP-based models, it doesn't reliably emphasize just the "curly hairstyle." This can end up affecting unintended parts of the image, or make generation less stable overall.
Prompt emphasis works intuitively and as intended only on models that primarily rely on a CLIP encoder, such as SD1.5 or SDXL.
LTX-2.3 vs WAN2.2: Speed and Quality Comparison
Let's compare generation with WAN2.2.
To keep things as fair as possible, I generated these with the settings below. Both LTX-2.3 and WAN2.2 have the fast distilled LoRA applied, no upscaling, and LTX-2.3's prompt enhancer disabled. The seed value is shared within the same model.
⚠️ These settings are a test skewed toward WAN2.2.
- GPU: RTX3090 (24GB)
- RAM: 128GB
- Resolution: 704 x 1280
- Length: 121 frames (5 seconds)
- Frame rate: 16
- Sampler: euler
- Scheduler: simple
A fox girl standing. She is looking at the camera. She blinks occasionally.
soft wind swaying her hair and cloth gently. A forest where a gentle breeze blows. grass and tree leaves are swaying with wind. subtle dust particles drifting in sunlight.
fix camera movements, ultra-detailedLTX-2.3 without SageAttention - 92 seconds
LTX-2.3 with SageAttention - 77 seconds
WAN2.2 SageAttention2.2 - 388 seconds
WAN2.2 SpargeAttention - 359 seconds
LTX-2.3 generates roughly 4.4x faster than WAN2.2. Looking at quality, WAN2.2 captures finer detail in its motion. That said, considering the generation speed and that it generates audio as well, LTX-2.3 is a solid option too.
Customizing the Official Workflow
From here, I'll introduce "DCAI LTX-2.3 I2V FLF Interpolation" an improved version built on top of ComfyUI's official "LTX-2.3: Image to Video" workflow, aimed at higher-quality video. The customizations are as follows.
- Splitting the generation stages: Preview at low resolution first, and only move on to high resolution if you're happy with it — splitting it into two stages eliminates wasted high-resolution processing.
- Video control: Uses FLF to make the video a seamless, infinite loop.
- Frame interpolation: You can use a frame interpolation node to raise the output video's frame rate.
- Consistency support via LoRA: Uses LoRA to improve character consistency.
- app mode (Beta): Uses app mode so you can work with a simple UI.
✅ In the paid section, besides the custom workflow, I also cover techniques for generating high resolution, among other things.
The workflow and input assets are published on Patreon. Only paid supporters can view and download them.
Here are video samples generated with this custom workflow.
Example of a loop video
Example of speaking Japanese
Summary
In this article, I covered the features of LTX-2.3, how to set it up in ComfyUI, and the basic generation steps using the official template.
- LTX-2.3 is a 22B-parameter model that generates video and audio simultaneously, and a key feature is that it runs relatively lightweight.
- Using the official ComfyUI "LTX-2.3: Image to Video" template, you can try generating just by gathering the required models.
- Write prompts as natural English sentences, and keep in mind that the role they play differs between T2V and I2V.
- Compared to WAN2.2, it excels in generation speed, but WAN2.2 appears to have the edge in expressing fine motion detail.
In the second half, I also cover the custom workflow "DCAI LTX-2.3 I2V FLF Interpolation" aimed at even higher-quality video, points to watch for during high-resolution generation, and the results of testing Japanese voice generation. If you're interested, please check out the full article as well.
Thank you for reading to the end.
If you found this even a little helpful, please support by giving it a “Like”!


