Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Text-to-Video Generation (Overlap v/s Non-Overlap Latent)

ProMAG-16× Latent Space

Similar to MagViTv2, we encode and decode 17 frames (chunk) at a time. Thus we can observe jumps in text-to-video results every 17 frames in regions of high frequency details (left). We show that training DiT with overlapped encoded frames can mitigate this issue. Here, we train the text-to-video model (DiT) on encoded frames with an overlap of 4 video frames, and while generating text-to-video results, we blend the overlapped frames across 2 chunks with linear interpolation weights. We show a comparison of videos generated of text-to-video trained our method ProMAG at 16× temporal compression with with and without frame overlapping. Frame overlapping removes the jump in textures, otherwise observed every 17 frames (right). Here we use same latent frames for both cases. Text-to-Video results without overlapping generated 68 frames, whereas ones with 4 frame overlapping generate 56 frames, both at 192×352 resolution.

16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of tower.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of a wooden bench in the park.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of a beautiful wrought-iron bench surrounded by blooming flowers.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of a vintage rocking chair was placed on the porch


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of the old red barn stood weathered and iconic against the backdrop of the countryside.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

n a still frame, within the desolate desert, an oasis unfolded, characterized by the stoic presence of palm trees and a motionless, glassy pool of water.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

In a still frame, the Parthenon's majestic Doric columns stand in serene solitude atop the Acropolis, framed by the tranquil Athenian landscape.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of a tranquil lakeside cabin nestled among tall pines, its reflection mirrored perfectly in the calm water.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of barn.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of cliff.


16× Latent (No Overlap)

16× Latent (Overlap = 4)

A tranquil tableau of kitchen.