We show that the reconstruction obtained by directly training MagViTv2 for 16× temporal compression leads to poor reconstruction quality for a 24fps video (middle). 𝑠𝑓=1 stands for frame subsampling factor = 1. However, we observed that the 4× temporal compression MagViTv2 can still accurately reconstruct a 6fps video by feeding the same 24fps video after subsampling frames by a factor of 4, 𝑠𝑓=4, (right). This observation implies that it is not necessarily the large motion that leads to worse reconstruction, but that training all the weights of a much larger number of encoder and decoder blocks makes training unstable.
Ground-Truth
MagViTv2-16× (𝑠𝑓=1)
MagViTv2-4× (𝑠𝑓=4)