MMFace-DiT: Dual-Stream Diffusion Transformer

Video Presentation

Figure 1: High-Fidelity Face Synthesis. MMFace-DiT synthesizes photorealistic portraits from multi-modal inputs. Left: Given a semantic mask and text prompt, our model generates a face with diverse identity variations across multiple VAEs. Right: Guided by a sketch, it performs precise attribute-guided generation for numerous hair colors. This demonstrates our model’s ability to seamlessly fuse spatial and semantic guidance.

Abstract

Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. However, existing approaches typically append auxiliary control modules or stitch together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities, leading to modal dominance.

We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over five state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling.

Methodology

Generation Pipeline

Dual-Stream Architecture

Scroll horizontally to view the Pipeline & Architecture

1. Unified Conditioning & Dynamic Modality Adaptation

Unlike prior works requiring separate models per modality, MMFace-DiT adapts to masks or sketches dynamically in a single forward pass. This is driven by our global conditioning vector, \(C_{global}\), which introduces a novel Modality Embedder \(E_{modality}\) that maps a discrete modality flag to a dense vector:

C_{global} = E_{time}(t) + E_{caption}(c_{pooled}) + E_{modality}(m)

2. Adaptive Layer Normalization (AdaLN)

The unified global conditioning vector orchestrates the behavior of each block independently. It is transformed to generate a comprehensive set of modulation parameters \(\{\gamma, \beta, \alpha\}\) for both the attention and MLP components. This enables text, timestep, and the active modality to exert fine-grained, layer-specific control over the entire network.

3. Dual-Stream Shared RoPE Attention for Deep Fusion

Our transformer processes image tokens (\(T_i\)) and text tokens (\(T_t\)) in parallel streams. To prevent modal dominance, they are continuously fused via a central, shared Multi-Head Attention mechanism. We apply 2D axial RoPE for spatial image patches and 1D sequential RoPE for text tokens:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{\text{RoPE}(Q)\text{RoPE}(K)^T}{\sqrt{d_k}}\right)V

This mechanism allows every image patch to bidirectionally attend to every text token, ensuring precise spatial-semantic alignment.

4. Dynamic Gated Residual Connections

Following attention and MLP operations, we employ a gating connection to modulate the output stream \(T_{in}\). The gating scalar \(\alpha\) derived from \(C_{global}\) acts as a dynamic, learned filter to selectively emphasize or suppress modalities. This prevents strong geometric priors (e.g., a dense sketch) from overpowering subtle semantic cues (e.g., text descriptors):

T_{out} = T_{in} + \alpha \odot F(\text{AdaLN}(T_{in}, \gamma, \beta))

5. VLM-Powered Data Enrichment & Optimization

To overcome the bottleneck of semantically shallow annotations in existing face datasets, we utilize a robust annotation pipeline built on the InternVL3 Visual Language Model and Qwen3 LLM. This yields 1M high-quality, descriptive captions. The entire framework operates in the compressed latent space of the 16-channel FLUX VAE and natively supports optimization via both DDPM (Min-SNR) and Rectified Flow Matching (RFM) objectives.

Qualitative Results

Comparison of MMFace-DiT against leading spatial conditioning methods.

Text + Semantic Mask Generation

Text + Sketch Generation

Scroll horizontally to view Mask & Sketch results

Quantitative Results

Table 1: Quantitative results for Text + Mask conditioned face generation. Our MMFace-DiT variants include Ours (D), trained with diffusion-based DDPM objectives, and Ours (F), trained using flow-matching objectives. Both substantially outperform all baselines across perceptual quality and text-image alignment metrics. Best results are highlighted.

Method	FID ↓	LPIPS ↓	SSIM ↑	ACC ↑	mIoU ↑	CLIP ↑	Dist. ↓	LLM Sc. ↑
TediGAN	62.55	0.43	0.48	79.77	39.02	25.26	0.75	0.4061
ControlNet	49.39	0.57	0.41	82.86	43.95	25.39	0.75	0.3103
UAC	48.88	0.46	0.48	78.27	38.82	23.75	0.76	0.3516
CD	49.00	0.56	0.46	85.69	38.85	25.07	0.75	0.3029
DDGI	50.88	0.45	0.49	86.00	36.02	24.29	0.76	0.3851
MM2Latent	49.78	0.59	0.45	84.57	38.19	26.78	0.73	0.3619
Ours (D)	27.95	0.34	0.51	93.95	49.16	31.69	0.68	0.6006
Ours (F)	16.63	0.34	0.53	93.74	50.12	31.34	0.69	0.6372

Table 2: Quantitative results for Text + Sketch conditioned face generation. Our MMFace-DiT includes Ours (D) (diffusion-based DDPM) and Ours (F) (flow-matching) variants.

Method	FID ↓	LPIPS ↓	SSIM ↑	CLIP ↑	Dist. ↓	LLM Sc. ↑
TediGAN	121.24	0.55	0.30	21.62	0.78	0.10
ControlNet	67.13	0.54	0.56	26.17	0.74	0.44
UAC	118.52	0.61	0.41	22.92	0.77	0.27
DDGI	56.57	0.43	0.51	23.95	0.76	0.43
MM2Latent	40.91	0.58	0.46	27.04	0.73	0.39
Ours (D)	27.67	0.24	0.72	31.56	0.68	0.69
Ours (F)	9.14	0.20	0.70	31.30	0.69	0.72

Ablation Study: Core Components

Table 3: Ablation study on core model components with spatial metrics. We incrementally add our innovations, demonstrating the impact of the Modality Embedder (ME), Dual-Stream (DS) design, Rotary Position Embedding (RoPE) Attention, and the final VAE choice. Spatial metrics (SSIM, ACC, mIoU) show simultaneous improvement with semantic metrics, proving mitigation of modality dominance.

Model	Components				Semantic Metrics			Spatial Metrics
Model	ME	DS	RoPE	VAE	FID ↓	LPIPS ↓	CLIP ↑	SSIM ↑	ACC ↑	mIoU ↑
Model-1	✗	✗	✗	SD2	44.52	0.486	24.53	0.44	86.65	44.86
Model-2	✓	✗	✗	SD2	40.49	0.366	24.31	0.46	87.91	46.34
Model-3	✓	✓	✗	SD2	35.61	0.367	29.69	0.49	90.79	48.91
Model-4	✓	✓	✓	SD2	33.77	0.326	31.42	0.50	92.29	50.05
Model-5 (Final)	✓	✓	✓	Flux	27.95	0.340	31.69	0.51	93.95	49.16

VAE Architecture Comparison

Figure 6: Qualitative comparison of VAE backbones integrated into MMFace-DiT. The choice of VAE highlights a clear trade-off between statistical fidelity and perceptual quality. Flux consistently yields the most perceptually faithful outputs with superior color accuracy and natural texture.

BibTeX

@inproceedings{krishnamurthy2026mmfacedit,
  title     = {MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation},
  author    = {Krishnamurthy, Bharath and Rattani, Ajita},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}