1. Unified Conditioning & Dynamic Modality Adaptation
Unlike prior works requiring separate models per modality, MMFace-DiT adapts to masks or sketches dynamically in a single forward pass. This is driven by our global conditioning vector, \(C_{global}\), which introduces a novel Modality Embedder \(E_{modality}\) that maps a discrete modality flag to a dense vector:
$$ C_{global} = E_{time}(t) + E_{caption}(c_{pooled}) + E_{modality}(m) $$
2. Adaptive Layer Normalization (AdaLN)
The unified global conditioning vector orchestrates the behavior of each block independently. It is transformed to generate a comprehensive set of modulation parameters \(\{\gamma, \beta, \alpha\}\) for both the attention and MLP components. This enables text, timestep, and the active modality to exert fine-grained, layer-specific control over the entire network.
3. Dual-Stream Shared RoPE Attention for Deep Fusion
Our transformer processes image tokens (\(T_i\)) and text tokens (\(T_t\)) in parallel streams. To prevent modal dominance, they are continuously fused via a central, shared Multi-Head Attention mechanism. We apply 2D axial RoPE for spatial image patches and 1D sequential RoPE for text tokens:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{\text{RoPE}(Q)\text{RoPE}(K)^T}{\sqrt{d_k}}\right)V $$
This mechanism allows every image patch to bidirectionally attend to every text token, ensuring precise spatial-semantic alignment.
4. Dynamic Gated Residual Connections
Following attention and MLP operations, we employ a gating connection to modulate the output stream \(T_{in}\). The gating scalar \(\alpha\) derived from \(C_{global}\) acts as a dynamic, learned filter to selectively emphasize or suppress modalities. This prevents strong geometric priors (e.g., a dense sketch) from overpowering subtle semantic cues (e.g., text descriptors):
$$ T_{out} = T_{in} + \alpha \odot F(\text{AdaLN}(T_{in}, \gamma, \beta)) $$
5. VLM-Powered Data Enrichment & Optimization
To overcome the bottleneck of semantically shallow annotations in existing face datasets, we utilize a robust annotation pipeline built on the InternVL3 Visual Language Model and Qwen3 LLM. This yields 1M high-quality, descriptive captions. The entire framework operates in the compressed latent space of the 16-channel FLUX VAE and natively supports optimization via both DDPM (Min-SNR) and Rectified Flow Matching (RFM) objectives.