Adapting Pre-Trained DiT Models to Vertical Images: A Training-Free Approach

Adapting Pre-Trained DiT Models to Vertical Images: A Training-Free Approach

Have you ever tried to apply a pre-trained DiT model to a new image orientation, only to encounter disappointing results? I’m sure I’m not the only one. Recently, I faced a similar challenge when I attempted to use a pre-trained conditional DiT model on vertical images, specifically human portraits.

The model was originally trained on horizontal images with a fixed resolution of 1280×720. But when I applied it to vertical images with a resolution of 720×1280, I noticed some unpleasant artifacts, especially at the bottom region.

The Challenge

The issue lies in the way the model processes the input data. The absolute positional embedding (APE) via per-axis SinCos and the 2D-Rope used in the attention calculation are both sensitive to the latent size. Since the vertical images have a different resolution, the model struggles to adapt.

The Question

Is there a training-free trick to enhance the performance of the pre-trained DiT model on vertical images? I’ve tried rotating the APE and RoPE embeddings to simulate a ‘horizontal latent’ for the vertical input, but it didn’t work.

Why This Matters

Adapting pre-trained models to new image orientations can have significant applications in various fields, such as computer vision, robotics, and healthcare. If we can find a way to make it work without requiring additional training data or computational resources, it could be a game-changer.

The Search for a Solution

I’m still searching for a solution to this problem. If you have any experience or insights, I’d love to hear them. Together, let’s explore the possibilities of training-free adaptation for DiT models.


Further reading: Conditional DiT Models

Leave a Comment

Your email address will not be published. Required fields are marked *