MOSAIC

MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement

¹ ByteDance
² Zhejiang University
³ The Hong Kong University of Science and Technology
^* Equal Contribution | ^† Project lead | ^‡ Corresponding Author

Abstract

Multi-subject personalized generation presents unique challenges in maintaining identity fidelity and semantic coherence when synthesizing images conditioned on multiple reference subjects. Existing methods often suffer from identity blending and attribute leakage due to inadequate modeling of how different subjects should interact within shared representation spaces. We present MOSAIC, a representation-centric framework that rethinks multi-subject generation through explicit semantic correspondence and orthogonal feature disentanglement. Our key insight is that multi-subject generation requires precise semantic alignment at the representation level—knowing exactly which regions in the generated image should attend to which parts of each reference. To enable this, we introduce SemAlign-MS, a meticulously annotated dataset providing fine-grained semantic correspondences between multiple reference subjects and target images, previously unavailable in this domain. Building on this foundation, we propose the semantic correspondence attention loss to enforce precise point-to-point semantic alignment, ensuring high consistency from each reference to its designated regions. Furthermore, we develop the multi-reference disentanglement loss to push different subjects into orthogonal attention subspaces, preventing feature interference while preserving individual identity characteristics. Extensive experiments demonstrate that MOSAIC achieves SOTA performance on multiple benchmarks. Notably, while existing methods typically degrade beyond 3 subjects, MOSAIC maintains high fidelity with 4+ reference subjects, opening new possibilities for complex multi-subject synthesis applications.

Overview of MOSAIC Framework

MOSAIC introduces two key supervisions: (1) Semantic Correspondence Attention Loss (blue region) enforces precise point-to-point alignment between reference tokens and their corresponding locations in the target latent, ensuring high consistency; (2) Multi-Reference Disentanglement Loss (green region) maximizes the divergence between different references’ attention distributions, pushing each subject into orthogonal representational subspaces.

SemAlign-MS Dataset Construction Pipeline

To address the lack of multi-reference data with validated semantic correspondences in the community, we developed a five-stage systematic pipeline for generating high-quality multi-reference training data.

Ablation Studies

We illustrate the progressive improvement of attention alignment and disentanglement with our proposed losses. Attention maps from reference regions (teddy bear’s goggles and Minion’s “001” text) show that: (1) the baseline produces scattered attention with cross-reference interference; (2) $ \mathcal{L}_\text{SCA} $ enhances semantic alignment but remains diffuse; (3) $ \mathcal{L}_\text{SCA} + \mathcal{L}_\text{rd} $ achieves precise alignment and effective disentanglement. Blue and green lines link reference regions to their attention peaks.

BibTeX

@misc{she2025mosaicmultisubjectpersonalizedgeneration, title={MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement}, author={Dong She and Siming Fu and Mushui Liu and Qiaoqiao Jin and Hualiang Wang and Mu Liu and Jidong Jiang}, year={2025}, eprint={2509.01977}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.01977}, }

MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement

Abstract

Overview of MOSAIC Framework

SemAlign-MS Dataset Construction Pipeline

More Results

Experimental Results

Qualitative comparison on complex reference-driven generation (≥3 reference images).

Qualitative comparison on simple reference-driven generation (≤2 reference images).

Ablation Studies

BibTeX