Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces

Reconstruction Comparison (Overlap v/s Non-Overlap)

[16× Temporal Compression]

Similar to MagViTv2, we encode and decode 17 frames (chunk) at a time. Thus we can observe slight jumps in reconstruction results every 17 frames in regions of high frequency details (left). We show that doing encoding decoding with overlap of frames can mitigate this issue. Here, we encode frames with an overlap of 4 video frames, and while decoding, we blend the overlapped frames across 2 chunks with linear interpolation weights. We show a comparison of reconstruction results of our method ProMAG at 16× temporal compression with with and without frame overlapping. Frame overlapping removes the jump in textures, otherwise observed every 17 frames (right).

Ground-Truth