Multi-subject personalized generation presents unique challenges in maintaining identity fidelity and semantic coherence when synthesizing images conditioned on multiple reference subjects. Existing methods often suffer from identity blending and attribute leakage due to inadequate modeling of how different subjects should interact within shared representation spaces. We present MOSAIC, a representation-centric framework that rethinks multi-subject generation through explicit semantic correspondence and orthogonal feature disentanglement. Our key insight is that multi-subject generation requires precise semantic alignment at the representation level—knowing exactly which regions in the generated image should attend to which parts of each reference. To enable this, we introduce SemAlign-MS, a meticulously annotated dataset providing fine-grained semantic correspondences between multiple reference subjects and target images, previously unavailable in this domain. Building on this foundation, we propose the semantic correspondence attention loss to enforce precise point-to-point semantic alignment, ensuring high consistency from each reference to its designated regions. Furthermore, we develop the multi-reference disentanglement loss to push different subjects into orthogonal attention subspaces, preventing feature interference while preserving individual identity characteristics. Extensive experiments demonstrate that MOSAIC achieves SOTA performance on multiple benchmarks. Notably, while existing methods typically degrade beyond 3 subjects, MOSAIC maintains high fidelity with 4+ reference subjects, opening new possibilities for complex multi-subject synthesis applications.
MOSAIC introduces two key supervisions: (1) Semantic Correspondence Attention Loss (blue region) enforces precise point-to-point alignment between reference tokens and their corresponding locations in the target latent, ensuring high consistency; (2) Multi-Reference Disentanglement Loss (green region) maximizes the divergence between different references’ attention distributions, pushing each subject into orthogonal representational subspaces.
To address the lack of multi-reference data with validated semantic correspondences in the community, we developed a five-stage systematic pipeline for generating high-quality multi-reference training data.
We illustrate the progressive improvement of attention alignment and disentanglement with our proposed losses. Attention maps from reference regions (teddy bear’s goggles and Minion’s “001” text) show that: (1) the baseline produces scattered attention with cross-reference interference; (2) $ \mathcal{L}_\text{SCA} $ enhances semantic alignment but remains diffuse; (3) $ \mathcal{L}_\text{SCA} + \mathcal{L}_\text{rd} $ achieves precise alignment and effective disentanglement. Blue and green lines link reference regions to their attention peaks.
BibTex Code Here