MY ALT TEXT

Abstract

Recent advances in personalized generative models demonstrate impressive results in creating identity-consistent images of the same person under diverse settings. Yet, we note that most methods cannot control the viewpoint of the generated image, nor generate consistent multiple views of the person. To address this problem, we propose a lightweight adaptation method, PersonalView, capable of enabling an existing model to acquire multi-view generation capability with as few as 100 training samples. PersonalView consists of two key components: First, we design a conditioning architecture to take advantage of the in-context learning ability of the pre-trained diffusion transformer. Second, we preserve the original generative ability of the pretrained model with a new Semantic Correspondence Alignment Loss. We evaluate the multi-view consistency, text alignment, identity similarity, and visual quality of PersonalView and compare it to recent baselines with potential capability of multi-view customization. PersonalView significantly outperforms baselines trained on a large corpus of multi-view data with only 100 training samples.

Overall Framework of PersonalView

MY ALT TEXT

In step 1, we use SMPL to fit the body mesh corresponding to the sample from the personalized generator. Then we render the body mesh for multi-view depth maps. With the in-context depth maps, we can generate the multi-view customization images in step 2 using the personalized model with our control Adapter.

Qualitative comparison

MY ALT TEXT

DiffPortrait3D and Era3D exhibit limitations in maintaining geometric and visual consistency, especially with regard to full-body and background regions. Although ViewCrafter achieves improved scene modeling, it does so at the expense of geometric consistency in human representations. Besides, both BAGEL and Qwen-Image demonstrate suboptimal performance in terms of multi-view control. In contrast, our PersonalView achieves superior performance in both geometric fidelity and visual coherence across views.