Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Text-to-Video Generation (Long Video)

[ProMAG-16× Latent Space]

We have shown theoritically that with our highly compact ProMAG-16× latent (16× temporal compression) space, we can generate 340 frames with same token budget as required to generate 68 frames with ProMAG-4× latent (4× temporal compression) space. We show examples of text-to-video generation results for very long videos (14.1s). Videos results are consistent with the text prompt and contain realistic motion. Generated videos are at 192×320 resolution and have 340 frames (around 14.1s at 24fps).

Note: Hover over the text to see the full prompt.

An older man playing piano,
lit from the side.

A slow cinematic push in on an ostrich standing in a 1980s kitchen.

An empty warehouse, zoom in into a wonderful jungle that emerges from...


A dynamic motion shot of
ethereal underwater caustics
dancing across a sandy seabed.

Yellow mold growing in a petri dish, moody and dim lighting, cool tones, cold color grade, dynamic motion.

Hyperspeed hand held camera. An irregular sphere shape ball dramatically undulates, warps and explodes as it...


An ultra-fast first-person POV hyper-lapse rapidly speeding through a forest fire into a snow capped...

Apple transforming into
a baseball

Create a high-quality, realistic video portrait of an android posing against a black background...


A person sitting on a bed looking at a spectacular night sky full of galaxies and stars, view from behind...

High angle travelling of the city of Paris, completely submerged underwater, rich marine life.

Space Astronaut floating in space
in front of a Magnetic field
energy wormhole.


A cinematic time-lapse drone video of a journey from the New York City to the shores of Miami...

Starting from a ground-level view of a road leading towards a tunnel in a graffiti covered...

Visual: A night scene in a city with wet streets reflecting city lights. The camera starts on the...


The camera rotates around a large stack of vintage televisions all showing different programs...

A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi...

Tour of an art gallery with many beautiful works of art
in different styles.