Wan 2.2 Basic Guide: Getting Started with High-Quality AI Video Generation in ComfyUI

- What Is the WAN 2.2 Video Generator?
- How to Set Up Wan 2.2 in ComfyUI
- Walkthrough: ComfyUI_examples 5B TI2V Workflow
- Walkthrough: ComfyUI_examples A14B I2V Workflow
- Wan 2.2 Render Times Explained
- Wan 2.2 Prompt Design Best Practices
- Customize the Official Wan 2.2 Workflows
- Pro Workflow Tips
- Push Speed Further
- Conclusion
This guide organizes how to work with Wan 2.2 for AI video production. At the time of writing, video generation dominates the generative AI scene, and DCAI previously highlighted “ComfyUI-AnimateDiff-Evolved” as our recommended custom node. Back then only a handful of video models could run locally, so cloud-first services such as Sora, Runway, and Luma AI led the pack, but now excellent locally runnable models like Tencent’s Hunyuan keep arriving. Here we focus on Alibaba Cloud’s open-source video creation suite “🔗Wan 2.2”. Among open-source video generators you can operate yourself, it currently ranks among the best. Wan is an AI video editing suite that offers paid on-demand generation and API-based cloud access, plus a free open-source runtime you can host locally. ComfyUI and SwarmUI already support local execution, and this article walks through the ComfyUI workflow. We build on the official guide and dig into techniques that harden stability. Let’s cover Wan fundamentals first, then expand the workflow to aim for “high-quality videos”.
What Is the WAN 2.2 Video Generator?
“Wan 2.2” is a large-scale diffusion Transformer video model that employs a two-stage Mixture-of-Experts (MoE) design with a high-noise (initial) phase and a low-noise (final) phase (A14B model only). Feed it text or reference images to render cinematic, high-quality footage. The paper “🔗Wan: Open and Advanced Large-Scale Video Generative Models” introduces a new VAE structure and scaling strategy that you can use inside ComfyUI as-is.
- Supported tasks: Text-to-video (T2V), image-to-video (I2V), and speech-to-video (S2V) are available
- Default resolution: T2V/I2V deliver 480p–720p; TI2V-5B is tuned for 720p@24fps
- Model lineup: MoE A14B models for T2V/I2V, a hybrid 5B TI2V model, plus a dedicated Wan 2.2 (TI2V-5B) VAE
- GPU memory requirements: T2V/I2V/S2V-A14B models target 80 GB-class GPUs, but ComfyOrg’s FP8 repack lets you offload to an RTX 4090 (24 GB). TI2V-5B needs roughly 24 GB
Wan 2.1 vs. Wan 2.2 Key Differences
Alongside the MoE architecture, Wan 2.2 trains on a dataset that increases image coverage by 65.6% and video coverage by 83.2% compared with Wan 2.1. The 5B model introduces a new 16×16×4 compression VAE, allowing 720p@24fps output. (From the 🔗Wan2.2 README)
How to Set Up Wan 2.2 in ComfyUI
To run the WAN Video Generation workflow, place the model files in the correct folders. Update ComfyUI to the latest build first, then follow the sequence below.
ComfyUI has been unstable lately. The frontend and core still feel out of sync, and I keep running into bugs. Installing the Wan components alone corrupted my environment and forced a clean install. If you want to preserve your current ComfyUI setup, back it up before installing or spin up a fresh ComfyUI Portal so you can build Wan in a separate environment.
Download Wan 2.2 Model Files
Grab the files from the Hugging Face repository “Comfy-Org/Wan_2.2_ComfyUI_Repackaged” and place them under ComfyUI/models
.
⚠️The T2V-A14B/I2V-A14B models ship in High Noise and Low Noise pairs. They behave like SDXL’s base and refiner: start inference with High Noise, then hand over to Low Noise mid-run to polish the finish.
- T2V A14B:
wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors
+wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors
(an FP16 build is bundled, so switch if you have VRAM headroom) - I2V A14B:
wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors
+wan2.2_i2v_low_noise_14B_fp8_scaled.safetensors
. For maximum fidelity, use the matching FP16 versions (..._fp16.safetensors
). - TI2V 5B / S2V 14B: Task-specific builds such as
wan2.2_ti2v_5B_fp16.safetensors
andwan2.2_s2v_14B_fp8_scaled.safetensors
live in the same repository. - LoRA: Pair
wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise.safetensors
withwan2.2_t2v_lightx2v_4steps_lora_v1.1_low_noise.safetensors
, and use the I2V setwan2.2_i2v_lightx2v_4steps_lora_v1_high_noise.safetensors
+wan2.2_i2v_lightx2v_4steps_lora_v1_low_noise.safetensors
. - Text Encoder:
umt5_xxl_fp16.safetensors
or the lighterumt5_xxl_fp8_e4m3fn_scaled.safetensors
. - Audio Encoder: Use
wav2vec2_large_english_fp16.safetensors
when you need audio-driven generation. - VAE: Load
wan_2.1_vae.safetensors
for the A14B models andwan2.2_vae.safetensors
for the 5B model.
Example placement:
ComfyUI/
├── 📁 models/
│ ├── 📁 audio_encoders/
│ │ └── wav2vec2_large_english_fp16.safetensors
│ ├── 📁 diffusion_models/
│ │ ├── wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors
│ │ ├── wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors
│ │ ├── wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors
│ │ ├── wan2.2_i2v_low_noise_14B_fp8_scaled.safetensors
│ │ ├── wan2.2_ti2v_5B_fp16.safetensors
│ │ └── wan2.2_s2v_14B_fp8_scaled.safetensors
│ ├── 📁 loras/
│ │ ├── wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise.safetensors
│ │ ├── wan2.2_t2v_lightx2v_4steps_lora_v1.1_low_noise.safetensors
│ │ ├── wan2.2_i2v_lightx2v_4steps_lora_v1_high_noise.safetensors
│ │ └── wan2.2_i2v_lightx2v_4steps_lora_v1_low_noise.safetensors
│ ├── 📁 text_encoders/
│ │ ├── umt5_xxl_fp8_e4m3fn_scaled.safetensors
│ │ └── umt5_xxl_fp16.safetensors
│ └── 📁 vae/
│ ├── wan_2.1_vae.safetensors
│ └── wan2.2_vae.safetensors
Walkthrough: ComfyUI_examples 5B TI2V Workflow


The “Wan 2.2 Models” page on ComfyUI_examples ships with a baseline Wan 2.2 video workflow. This section builds on it to show how to render 720p clips with the text+image-to-video (TI2V) 5B model. The layout is simple and works for both text-only runs and I2V jobs that start from a still image.
Load the Image to Video sample to understand the Text to Video path as well. Download the sample image and drag it into ComfyUI, or right-click “Workflow in Json format” under the image, save it, and import the JSON file.
You can grab the input image from 🔗here.
Load the Models
- UNETLoader: Load
wan2.2_ti2v_5B_fp16.safetensors
as the diffusion backbone. The TI2V model conditions on both text and images and is tuned for 720p@24fps (480p is not supported). - ModelSamplingSD3: This Stable Diffusion 3 node is repurposed here to rebuild the sampling schedule, letting you adjust noise levels.
- CLIPLoader: Use
umt5_xxl_fp8_e4m3fn_scaled.safetensors
(typewan
) for text conditioning. Wan 2.2 relies on a T5/CLIP hybrid, so this compatibility matters. - VAELoader: Decode with the latest
wan2.2_vae.safetensors
to improve color fidelity and detail.
The ModelSamplingSD3 node (a CONST head plus ModelSamplingDiscreteFlow) remaps the denoising curve via its internal time_snr_shift
. You can choose samplers such as euler
, heun
, dpmpp_2m
, or uni_pc
, but the best option depends on your model, resolution, and step count.
✅Lowering shift
increases the influence of the early noise, making changes more dramatic, while raising it calms the tail end and leans toward static, detailed frames (results vary by environment).
When you generate with minimal steps—for example, using Lightning LoRA—the effect is weaker. Nudge the value toward 9 if you want more detail, or around 6 if you want stronger motion.
Prompt Setup for Wan 2.2
Wan uses the bilingual UMT5-XXL text encoder, so enter prompts in English or Simplified Chinese.
- Positive Prompt: Describe the scene and action clearly, e.g.,
a cute anime girl with fennec ears and a fluffy tail walking in a beautiful field
. - Negative Prompt: Numerous Chinese phrases are preloaded to suppress artifacts, helping you avoid color clipping, distortion, and missing limbs.
Configure Initial Latents and Duration
Use the Wan22ImageToVideoLatent
node to set resolution and frame count. The template defaults to 1280×704
, length=41
, and batch_size=1
. A Note node in the corner notes that 121 frames are recommended, but the template keeps it shorter for quicker previews. Connect a still image to start_image
for I2V, or leave it empty for text-only runs.
Sampling Settings
The KSampler
node drives the generation.
- Steps: 30
- CFG: 5 (Raising it too high introduces flicker)
- Sampler:
simple
(the lightweight scheduler fromcomfy/samplers.py
, which samples evenly from thesigmas
array insideModelSamplingSD3
so you preserve Wan 2.2’s intended noise curve) - Scheduler:
uni_pc
- Seed: With
randomize
enabled you get a new clip each time. Switch “control after generate” tofixed
to lock it. - Denoise: 1.0 ⚠️Lowering this in I2V does not keep the source image, so leave it at
1.0
.
Decode and Export
- VAEDecode: Convert latents into frames with the Wan 2.2 VAE.
- SaveAnimatedWEBP: Export the 24fps sequence as an animated WebP at quality 80 for lightweight previews.
- SaveWEBM: Output a 24fps WebM via the
vp9
codec. Thecrf (bitrate)
is set around 16 Mbps, making it suitable as a high-quality master.
The Wan 2.2 5B model alone struggles to deliver consistently high-end footage, but it is lightweight and approachable—even lower-spec PCs can run it—so it works well as an entry point when you test Wan.
Walkthrough: ComfyUI_examples A14B I2V Workflow

The flow is almost identical to the 5B model. The key difference is that the A14B build ships with High Noise and Low Noise checkpoints, so you need to switch samplers mid-run, just like SDXL with a refiner.
Sampling
Use the KSampler (Advanced)
node twice. The official recommendation switches to the Low Noise model at 50% completion.
Pass 1
- Add Noise: enable
- Seed:
randomize
- Steps: 20
- CFG: 3.5
- Sampler:
euler
- Scheduler:
simple
- Start at step: 0
- End at step: 10 (stop at the specified step)
- Return with leftover noise: enable (keeps the latent with residual noise when you stop mid-process)
Pass 2
- Add Noise: disable (you reuse the noise from pass 1)
- Seed:
fixed
- Steps: 20
- CFG: 3.5
- Sampler:
euler
- Scheduler:
simple
- Start at step: 10 (resume from the chosen step)
- End at step: 10000
- Return with leftover noise: disable
This A14B example combines the euler
sampler with the simple
scheduler. The simple
mode samples evenly from the sigmas
array inside ModelSamplingSD3
, matching the training schedule released by the Wan team. You can pick other schedulers (normal
, karras
, exponential
, sgm_uniform
, beta
, linear_quadratic
, kl_optimal
, and so on), but they alter the time-SNR curve and often break motion, so avoid them outside of tests. Switching the sampler to dpmpp_2m
or dpmpp_2m_sde
changes smoothness slightly, yet you should still pair them with the simple
scheduler for safety.
Wan 2.2 Render Times Explained
Even in FP8, the Wan 2.2 A14B model takes longer than the 5B build. The sample Wan 2.2 A14B I2V setup ran for about 45 minutes on an RTX 3090, yet the MoE architecture keeps quality on top. Switching to the Wan 2.2 5B model with the same settings cut the render to roughly three minutes, but the results were unusable—a bouquet morphed into a gun mid-scene.
For local A14B use, rely on GGUF variants and Lightning 4-step LoRA. Windows users face more steps than Linux, but if your GPU supports it, installing Sage Attention
can shave significant time off inference.
If you need smoother A14B production, consider cloud GPU services such as RunPod.
Wan 2.2 Prompt Design Best Practices
Consistent frames are critical in video generation, so separate the “scene skeleton” from “cinematic keywords” in your prompts. Here’s a recommended T2V template:
{main_subject}, {outfit_detail}, shot on anamorphic lens, cinematic lighting, soft rim light, depth of field, trending on artstation
Negative prompt: motion blur, duplicated limbs, distorted face, overexposed, low detail background
Place the key elements (subject, outfit, etc.) at the top of the prompt and cluster cinematic keywords at the end to reduce drift. Prioritize riskier terms such as “motion blur” or “duplicated limbs” in the negative prompt.
For I2V, your reference image already defines subject, scene, and style, so focus the prompt on motion and camera direction.
The official guide “🔗Easy Creation with One Click – AI Videos” is also worth reading.
Customize the Official Wan 2.2 Workflows
Next we extend the official ComfyUI Wan 2.2 workflows to push quality higher. We provide separate customizations for the 5B and A14B models. The enhancements cover:
- Lightweight inference: Avoid situations where FP16 or FP8 is too heavy to run or takes excessive time.
- Loop-ready videos: Build seamless infinite loops.
- AI-driven upscaling: Wan 2.2 cannot run SDXL or Flux.1 as a second pass, so we upscale with dedicated models.
- Frame interpolation: Smooth out 16fps (A14B) and 24fps (5B) output with interpolation.
- Orientation toggle: Switch between portrait and landscape presets in one click.
The workflow is hosted on Patreon for paid supporters.

Here are sample clips generated with the custom workflow.
One clip uses the Wan2.2-I2V-5B model and the other the Wan2.2-I2V-A14B model. Both rely on ComfyUI defaults (the optional --use-sage-attention
and --fast
flags are disabled) and finish in about ten minutes, not counting upscaling.
Wan2.2-I2V-5B Model Sample
Wan2.2-I2V-A14B Model Sample
Pro Workflow Tips
Below are two pro techniques you can use during AI video production.
- Export frames as still images
- Generate videos like a storyboard
Use these tips to polish your footage into high-quality results like the example below.
Color grading and logo placement were added in post-processing.
Push Speed Further
Beyond Sage Attention
, which we referenced earlier, kijai’s “ComfyUI-WanVideoWrapper” adds finer control over Wan 2.2. This guide only introduces it briefly—we’ll cover the full setup in a future article.

Conclusion
Wan 2.2 centers on the A14B model and enables high-quality video production locally. In ComfyUI, proper model placement, the required custom nodes, High/Low switching, and GGUF/Lightning LoRA support let you balance stability and speed. Combine thoughtful prompt design, loop workflows, upscaling, interpolation, and external tooling to build a production-ready pipeline.
Thank you for reading to the end.
If you found this even a little helpful, please support by giving it a “Like”!