Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Figure 2 (Main Paper)

[Motivation of Progressve Growing]

We show that the reconstruction obtained by directly training MagViTv2 for 16× temporal compression leads to poor reconstruction quality for a 24fps video (middle). 𝑠_𝑓=1 stands for frame subsampling factor = 1. However, we observed that the 4× temporal compression MagViTv2 can still accurately reconstruct a 6fps video by feeding the same 24fps video after subsampling frames by a factor of 4, 𝑠_𝑓=4, (right). This observation implies that it is not necessarily the large motion that leads to worse reconstruction, but that training all the weights of a much larger number of encoder and decoder blocks makes training unstable.

Ground-Truth

MagViTv2-16× (𝑠_𝑓=1)

MagViTv2-4× (𝑠_𝑓=4)

Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Aniruddha Mahapatra¹

Long Mai¹

Yitian Zhang^1,2

David Bourgin¹

Feng Liu¹

¹ Adobe Research ² Northeastern University

Figure 2 (Main Paper)

[Motivation of Progressve Growing]

Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Aniruddha Mahapatra1

Long Mai1

Yitian Zhang1,2

David Bourgin1

Feng Liu1

1 Adobe Research 2 Northeastern University

Figure 2 (Main Paper)

[Motivation of Progressve Growing]

Aniruddha Mahapatra¹

Long Mai¹

Yitian Zhang^1,2

David Bourgin¹

Feng Liu¹

¹ Adobe Research ² Northeastern University