To fit the tokenizer in memory for high resolution encoding and decoding, we need to tile the latent spatially. Tilting the latent before passing latent through decoder causes artifacts in the reconstructed video (middle). Layer-wise Spatial Tiling resolves the artifacts.
Ground-Truth
Spatial Tiling before Decoder
Layer-wise Spatial Tiling