We show a comparison of reconstruction results of different baseline methods with our method ProMAG at 4× temporal comparison. The reconstruction results obtained by the video tokenizer in OpenSORA and OpenSORA-Plan are blurry and lack details, especially in regions of high-frequency textures (like leaves of a tree, or human faces). In contrast, MagViTv2 achieves sharper results compared to other baselines. We show that even after making modifications to MagViTv2, for efficiency and enabling progressive growing, our model ProMAG can achieve a similar reconstruction to MagViTv2. Finally, ProMAG with 16 channel latent has the best reconstruction quality in terms of detail preservation and motion quality.
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)
Ground-Truth
OpenSORA
OpenSORA-Plan
MagViTv2
ProMAG (zdim=8)
ProMAG (zdim=16)