Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Aniruddha Mahapatra¹

Long Mai¹

Yitian Zhang^1,2

David Bourgin¹

Feng Liu¹

¹ Adobe Research ² Northeastern University

ArXiv Project Gallery

First row: videos showing reconstruction results under different settings.
Second row: videos showing text-to-video geenrating results with different latent spaces.
Third row: videos corresponding to figures in the main paper.

Reconstruction Comparison
4X (Baselines) Reconstruction Comparison
8X Reconstruction Comparison
16X Reconstruction Comparison
Overlapping Chunks
Text-to-Video Generation
16X Latent Text-to-Video Generation
4X v/s 16X Latent Text-to-Video Generation
16X Latent Long Video Text-to-Video Generation
Overlapping Chunks
Figure 2
Main Paper Figure 3
Main Paper Figure 5
Main Paper Figure 9
Main Paper

Reconstruction Comparison (4× Temporal Compression)

We show a comparison of reconstruction results of different baseline methods with our method ProMAG at 4× temporal comparison. The reconstruction results obtained by the video tokenizer in OpenSORA and OpenSORA-Plan are blurry and lack details, especially in regions of high-frequency textures (like leaves of a tree, or human faces). In contrast, MagViTv2 achieves sharper results compared to other baselines. We show that even after making modifications to MagViTv2, for efficiency and enabling progressive growing, our model ProMAG can achieve a similar reconstruction to MagViTv2. Finally, ProMAG with 16 channel latent has the best reconstruction quality in terms of detail preservation and motion quality.

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)

Ground-Truth

OpenSORA

OpenSORA-Plan

MagViTv2

ProMAG (z_dim=8)

ProMAG (z_dim=16)