Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Aniruddha Mahapatra¹

Long Mai¹

Yitian Zhang^1,2

David Bourgin¹

Feng Liu¹

¹ Adobe Research ² Northeastern University

ArXiv Project Gallery

First row: videos showing reconstruction results under different settings.
Second row: videos showing text-to-video geenrating results with different latent spaces.
Third row: videos corresponding to figures in the main paper.

Reconstruction Comparison
4X (Baselines) Reconstruction Comparison
8X Reconstruction Comparison
16X Reconstruction Comparison
Overlapping Chunks
Text-to-Video Generation
16X Latent Text-to-Video Generation
4X v/s 16X Latent Text-to-Video Generation
16X Latent Long Video Text-to-Video Generation
Overlapping Chunks
Figure 2
Main Paper Figure 3
Main Paper Figure 5
Main Paper Figure 9
Main Paper

Figure 9 (Main Paper)

[Subsampled Encoding + External Interpolation]

We show videos comparing our method ProMAG with 16× temporal compression on a 24fps video, against a baseline where we first encode the video at low fps (with frame subsampling 𝑠_𝑓), in this case transforming the 24fps video into 6fps with 𝑠_𝑓=4, followed by using external interpolation method to generate the 4 in-between frames. The baseline with external frame interpolation produces very blurry outputs in regions of abrupt and high-intensity motion, compared to our method ProMAG-16× operating on 24fps video directly.

Ground-Truth

ProMAG-4× (𝑠_𝑓=4) + 4× Interpolation

ProMAG-16×

Ground-Truth

ProMAG-4× (𝑠_𝑓=4) + 4× Interpolation

ProMAG-16×

Ground-Truth

ProMAG-4× (𝑠_𝑓=4) + 4× Interpolation

ProMAG-16×

Ground-Truth

ProMAG-4× (𝑠_𝑓=4) + 4× Interpolation

ProMAG-16×