Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Figure 9 (Main Paper)

[Subsampled Encoding + External Interpolation]

We show videos comparing our method ProMAG with 16× temporal compression on a 24fps video, against a baseline where we first encode the video at low fps (with frame subsampling 𝑠𝑓), in this case transforming the 24fps video into 6fps with 𝑠𝑓=4, followed by using external interpolation method to generate the 4 in-between frames. The baseline with external frame interpolation produces very blurry outputs in regions of abrupt and high-intensity motion, compared to our method ProMAG-16× operating on 24fps video directly.

Ground-Truth

ProMAG-4× (𝑠𝑓=4) + 4× Interpolation

ProMAG-16×


Ground-Truth

ProMAG-4× (𝑠𝑓=4) + 4× Interpolation

ProMAG-16×


Ground-Truth

ProMAG-4× (𝑠𝑓=4) + 4× Interpolation

ProMAG-16×


Ground-Truth

ProMAG-4× (𝑠𝑓=4) + 4× Interpolation

ProMAG-16×