Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Text-to-Video Generation (ProMAG-4× v/s ProMAG-16× Latent Space)

We show examples comparing the text-to-video generation results of our ProMAG-4× latent (4× temporal compression) with ProMAG-16× latent (16× temporal compression). The highly compressed 16× latent latent space can achieve similar video generation quality as the base 4× latent, but with huge boost in efficiency compared to 4× latent space. Generated videos for both are at 192×320 resolution and have 68 frames.

4× Latent

16× Latent

A slow cinematic push in on an ostrich standing in a 1980s kitchen.


4× Latent

16× Latent

Entering a Martian cave to reveal an alien colony hidden within, Cinematic FPV.


4× Latent

16× Latent

Yellow mold growing in a petri dish, moody and dim lighting, cool tones, cold color grade, dynamic motion.


4× Latent

16× Latent

Hyperspeed hand held camera. An irregular sphere shape ball dramatically undulates, warps and explodes as it transforms into a completely different man. Surreal.


4× Latent

16× Latent

A young woman with vibrant red hair, adorned with a whimsical leafy crown, gazes off-camera with an expression of soft awe. Her freckled face is bathed in warm, golden sunlight, highlighting her porcelain complexion. The tousled strands of her hair dance in an unseen breeze, adding movement to the frame. A colorful, knitted scarf wraps her neck, contrasting with the serene blue waters visible in the background.


4× Latent

16× Latent

Tranquil lakeside scene during autumn. Wide shot of the entire lake, with colorful trees reflecting in the water. Slowly move the camera over the lake's surface, capturing ducks swimming and leaves floating. Finish with a close-up of a single leaf gently landing on the water.


4× Latent

16× Latent

A person sitting on a bed looking at a spectacular night sky full of galaxies and stars, view from behind, fisheye perspective, vibrant blue and purple colors, dreamlike and magical atmosphere, high resolution, intricate details in the clouds and galaxies, lighting dramatic.