Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Figure 2 (Main Paper)

[Motivation of Progressve Growing]

We show that the reconstruction obtained by directly training MagViTv2 for 16× temporal compression leads to poor reconstruction quality for a 24fps video (middle). 𝑠𝑓=1 stands for frame subsampling factor = 1. However, we observed that the 4× temporal compression MagViTv2 can still accurately reconstruct a 6fps video by feeding the same 24fps video after subsampling frames by a factor of 4, 𝑠𝑓=4, (right). This observation implies that it is not necessarily the large motion that leads to worse reconstruction, but that training all the weights of a much larger number of encoder and decoder blocks makes training unstable.

Ground-Truth

MagViTv2-16× (𝑠𝑓=1)

MagViTv2-4× (𝑠𝑓=4)