Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

¹ Adobe Research ² Northeastern University

First row: videos showing reconstruction results under different settings.
Second row: videos showing text-to-video geenrating results with different latent spaces.
Third row: videos corresponding to figures in the main paper.

Reconstruction Comparison
4X (Baselines) Reconstruction Comparison
8X Reconstruction Comparison
16X Reconstruction Comparison
Overlapping Chunks
Text-to-Video Generation
16X Latent Text-to-Video Generation
4X v/s 16X Latent Text-to-Video Generation
16X Latent Long Video Text-to-Video Generation
Overlapping Chunks
Figure 2
Main Paper Figure 3
Main Paper Figure 5
Main Paper Figure 9
Main Paper

Figure 5 (Main Paper)

[High-Rsolution Reconstruction]

To fit the tokenizer in memory for high resolution encoding and decoding, we need to tile the latent spatially. Tilting the latent before passing latent through decoder causes artifacts in the reconstructed video (middle). Layer-wise Spatial Tiling resolves the artifacts.

Ground-Truth

Spatial Tiling before Decoder

Layer-wise Spatial Tiling

Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Aniruddha Mahapatra¹

Long Mai¹

Yitian Zhang^1,2

David Bourgin¹

Feng Liu¹

¹ Adobe Research ² Northeastern University

Figure 5 (Main Paper)

[High-Rsolution Reconstruction]

Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Aniruddha Mahapatra1

Long Mai1

Yitian Zhang1,2

David Bourgin1

Feng Liu1

1 Adobe Research 2 Northeastern University

Figure 5 (Main Paper)

[High-Rsolution Reconstruction]

Aniruddha Mahapatra¹

Long Mai¹

Yitian Zhang^1,2

David Bourgin¹

Feng Liu¹

¹ Adobe Research ² Northeastern University