Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Figure 3 (Main Paper)

[Spot Artifacts]

Using GroupNorm for normalization across different layers causes the spot like artifacts at the corner of reconstruction result with our progressive growing approach. This reconstruction is for 8× temporal compression (growing from 4× ProMAG). We show that removing the mean subtraction from group normalization (Custom Norm) eliminates the spot artifacts.

Ground-Truth

GroupNorm

Custom Norm