Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

1 Adobe Research    2 Northeastern University

ArXiv Project Gallery

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 2
Main Paper
Figure 3
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Text-to-Video Generation (ProMAG-16× Latent Space)

We show text-to-video generation results with our ProMAG-16× latent (16× temporal compression). Our highly compressed latent friendly with video generation. DiT trained on this highly compressed latent can generate videos which accurately follow the prompts and generate realistic motion. Generated videos are at 192×320 resolution and have 68 frames.
Note: Our goal is not to compete with the state-of–the-art text-to-video generation method, but just to show that our highly latent is compatible for text-to-video generation. This (text-to-video generation) is a relatively small scaled experiment.

Note: Hover over the text to see the full prompt.

Extreme close up of a 24 year old woman eye blinking, standing in Marrakech during magic hour, cinematic...

The camera rotates around a large stack of vintage televisions all showing different programs 1950s sci-fi movies...

A stop motion animation of a flower growing out of the windowsill of a suburban house.

An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history...


A cat wearing sunglasses and working as a lifeguard at a pool.

A dog wearing virtual reality goggles in sunset, 4k, high resolution.

Beer pouring into glass,
low angle video shot.

Campfire at night in a snowy forest with starry sky in the background.


Wood on
fire.

A cool, psychedelic art, digital illustration of a sleek and stylish cat proudly perched...

In the aerial view of Santorini, white Cycladic buildings with blue domes are...

A cinematic close-up of a futuristic person clad in a long, dark science fiction coat...


A long exposure photograph capturing the tranquil harbor scene at night...

A woman with a flower headpiece, inspired by vray tracing, features vivid colors and...

A capybara made of pixelated voxels is seated in a lush green field...

An astronaut in a pressure suit is floating weightlessly among a breathtaking...


In a post-apocalyptic world, a German Shepherd dog wearing a bulky spacesuit lies amidst vibrant...

A cute smiling Yorkshire Terrier dressed in a cyberpunk costume sits comfortably on a futuristic chair. The chair...

illustration of a dog floating up into the sky in the flat vector rotoscoped style of "a scanner darkly," digital art.

An award winning photo of a stylishly dressed elderly woman wearing very large glasses in the style of...