Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Reconstruction Comparison (Overlap v/s Non-Overlap)

[16× Temporal Compression]

Similar to MagViTv2, we encode and decode 17 frames (chunk) at a time. Thus we can observe slight jumps in reconstruction results every 17 frames in regions of high frequency details (left). We show that doing encoding decoding with overlap of frames can mitigate this issue. Here, we encode frames with an overlap of 4 video frames, and while decoding, we blend the overlapped frames across 2 chunks with linear interpolation weights. We show a comparison of reconstruction results of our method ProMAG at 16× temporal compression with with and without frame overlapping. Frame overlapping removes the jump in textures, otherwise observed every 17 frames (right).

Ground-Truth

No Overlap

Overlap = 4


Ground-Truth

No Overlap

Overlap = 4


Ground-Truth

No Overlap

Overlap = 4


Ground-Truth

No Overlap

Overlap = 4


Ground-Truth

No Overlap

Overlap = 4