CVPR 2026

Latent-Compressed Variational Autoencoder
for Video Diffusion Models

LC-VAE

Jiarui Guan1 Wenshuai Zhao1,2 Zhengtao Zou1 Juho Kannala1,3 Arno Solin1,2

1Aalto University    2ELLIS Finland Institute    3University of Oulu

{jiarui.guan, wenshuai.zhao, zhengtao.zou, juho.kannala, arno.solin}@aalto.fi

LC-VAE teaser figure

Figure 1. Comparison between video VAEs with and without the proposed latent compression. LC-VAE performs frequency-aware latent compression for video generation. An input video is encoded and decomposed by multi-level 3D wavelet transforms (Multi-WT); low-frequency channels are retained as compact latent representations where diffusion operates. After denoising, the latent is zero-padded, processed by multi-level inverse wavelet transforms (Multi-IWT), and decoded into the final video. This design preserves global structure while reducing latent dimensionality.

Abstract

Compressing Latent Space by Removing
Uninformative High-Frequency Components

Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent studies have revealed that an excessive number of latent channels can impede the convergence of latent diffusion models and deteriorate their generative performance, even when reconstruction quality remains high. We propose a latent compression method that removes high-frequency components in video latent representations rather than directly reducing the number of channels, which often compromises reconstruction fidelity. Experimental results demonstrate that the proposed method achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio.

Method

LC-VAE: Frequency-Aware Latent Compression

LC-VAE compresses the video latent by applying a multi-level 3D Haar wavelet transform and selectively zeroing out the high-frequency subbands. The encoder is forced to focus on diffusion-favorable, low-frequency content — high-frequency texture recovery is delegated to the decoder.

LC-VAE framework overview

Figure 2. Overview of the LC-VAE framework. The model first applies a multi-level wavelet transform (Multi-WT) to the latent features produced by the encoder. Low-frequency channels are selected to retain compact yet informative representations, while the high-frequency subbands are zeroed out. During generation, diffusion operates within this compressed subspace. The sampled representation is subsequently zero-padded, passed through multi-scale inverse wavelet transforms (Multi-IWT), and decoded to reconstruct the video.

Wavelet Subband Selection — 3D Haar Decomposition

Frequency energy and correlation analysis

Figure 3. Energy and correlation distribution across frequencies. Low-frequency subbands exhibit higher energy and stronger temporal correlation, whereas high-frequency subbands are less informative.

Multi-level wavelet transform structure

Multi-level WT. Hierarchical 3D Haar wavelet decomposition producing 8 subbands per level — {LLL, LLH, LHL, HLL} are retained; {LHH, HLH, HHL, HHH} are zeroed out.

Subband zeroing rule:
  abc = Babc   if abc ∈ {LLL, LLH, LHL, HLL}   ← retained
  abc = 0       otherwise   ← zeroed out (HF)

Decode:  z̃ = 𝒲-1({B̃abc})  →  v̂ = 𝒟(z̃)
Subband zeroing rule:
  abc = Babc   if abc ∈ {LLL, LLH, LHL, HLL}   ← retained
  abc = 0       otherwise   ← zeroed out (HF)

Decode:  z̃ = 𝒲-1({B̃abc})  →  v̂ = 𝒟(z̃)

Experiments

State-of-the-Art Video Reconstruction

We evaluate LC-VAE against six strong baseline video VAEs across multiple large-scale benchmarks. Experiments cover video reconstruction quality, zero-shot generalization, and diffusion-based generation capability.

Validation PSNR training curves

Figure 5 — Validation PSNR during training. Across all channel sizes (Chn. = 4, 8, 16), LC-VAE consistently achieves higher validation PSNR than WF-VAE throughout training, demonstrating faster convergence and better reconstruction quality.

Table 1 — Main Reconstruction Performance

Zero-shot test on WebVid-10M and Panda-70M. TCPR = token compression ratio; Chn. = latent channels. Bold = best.

Method TCPR Chn. WebVid-10M Panda-70M
PSNR ↑SSIM ↑LPIPS ↓rFVD ↓ PSNR ↑SSIM ↑LPIPS ↓rFVD ↓
SD-VAE64×4 30.190.8380.057284.9 30.460.8900.040183.0
SVD-VAE64×4 31.180.8690.055188.7 31.040.9060.038137.7
CV-VAE256×4 30.760.8570.080369.2 30.180.8800.067296.3
OD-VAE256×4 30.690.8640.055255.9 30.310.8940.044191.2
Open-Sora VAE256×4 31.520.8760.056208.5 31.210.9100.040155.6
WF-VAE256×16 33.620.9120.03696.2 33.110.9380.02676.4
LC-VAE (Ours)256×16 34.410.9210.03174.8 34.070.9440.02258.3
Qualitative reconstruction comparison

Figure 4 — Qualitative comparison of reconstruction performance. LC-VAE vs. WF-VAE under the same compression ratios (equivalent channels). LC-VAE reconstructs finer details with fewer artifacts across diverse scenes.

Table 2 — Zero-Shot Generalization

Reconstruction on three unseen datasets. LC-VAE maintains consistently high performance while WF-VAE suffers a 0.5–1.5 dB PSNR drop on out-of-domain data.

Method Chn. UCF-101 SkyTimelapse OpenVid-1M
PSNR ↑SSIM ↑LPIPS ↓rFVD ↓ PSNR ↑SSIM ↑LPIPS ↓rFVD ↓ PSNR ↑SSIM ↑LPIPS ↓rFVD ↓
WF-VAE4 30.320.8960.043337.7 36.590.9440.020100.7 32.980.8830.042193.1
LC-VAE (Ours)4 30.570.9010.033340.2 36.870.9490.01587.1 33.480.8990.024145.0
WF-VAE8 31.860.9200.029189.5 37.710.9540.01461.7 34.200.9070.031109.9
LC-VAE (Ours)8 32.690.9290.023198.9 38.650.9620.01050.4 35.270.9250.01789.4
WF-VAE16 34.400.9490.01781.3 39.850.9680.00828.1 36.280.9330.01751.0
LC-VAE (Ours)16 34.830.9510.016106.1 40.300.9720.00827.3 37.060.9460.01250.4

Table 3 — Video Generation Quality (Latte Diffusion)

FVD₁₆ and IS evaluated on SkyTimelapse (unconditional) and UCF-101 (class-conditional) using Latte-L trained for 200k steps.

Method Chn. SkyTimelapse UCF-101
FVD₁₆ ↓ FVD₁₆ ↓IS ↑
WF-VAE4 198.87 565.8061.19
LC-VAE (Ours)4 240.56 509.7670.71
WF-VAE8 213.23 687.6060.57
LC-VAE (Ours)8 201.24 654.9666.72
WF-VAE16 195.94 721.4352.66
LC-VAE (Ours)16 187.68 735.0454.89
Generated videos using LC-VAE with Latte

Figure 6 — Generated videos using LC-VAE with Latte. SkyTimelapse (top) and UCF-101 (bottom) datasets. LC-VAE's compact low-frequency latent enables more coherent and higher-quality generation across diverse video categories.

Table 4 — Ablation: Joint Training vs. Post-Training Compression

Comparing LC-VAE against post-training latent compression (PTLC) applied to a pre-trained WF-VAE. Results confirm that joint training with latent compression is essential — PTLC degrades reconstruction by up to 5 dB PSNR.

Method Chn. WebVid-10M Panda-70M
PSNR ↑SSIM ↑LPIPS ↓ PSNR ↑SSIM ↑LPIPS ↓
WF-VAE (PTLC)8 29.240.8390.068 27.510.8530.078
LC-VAE (Ours)8 31.490.9210.025 31.890.8910.030
WF-VAE (PTLC)16 30.490.8730.055 28.670.8740.068
LC-VAE (Ours)16 33.780.9210.021 33.640.9450.017

Citation

BibTeX

If you find LC-VAE useful in your research, please consider citing our paper:

@inproceedings{guan2026lcvae,
  title = {Latent-Compressed Variational Autoencoder for Video Diffusion Models},
  author = {Guan, Jiarui and Zhao, Wenshuai and Zou, Zhengtao and Kannala, Juho and Solin, Arno},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2026},
}