VoxMorph: Scalable Zero-Shot Voice Identity Morphing

Figure 1: Architectural overview of the VoxMorph framework. The process consists of three core stages: (1) Extraction of disentangled prosody and timbre embeddings; (2) Interpolation via Spherical Linear Interpolation (Slerp); and (3) Synthesis via an autoregressive language model and Conditional Flow Matching (CFM) network.

Abstract

Morphing attacks threaten biometric security by creating synthetic samples that can impersonate multiple individuals. While extensively studied for face recognition, this vulnerability remains largely unexplored for voice biometrics. The only prior work on voice morphing is computationally expensive, non-scalable, and restricted to acoustically similar identity pairs. We propose VoxMorph, a novel zero-shot framework that generates high-fidelity voice morphs from as little as five seconds of audio per subject, without model retraining. Our approach disentangles vocal characteristics into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are blended via Spherical Linear Interpolation (Slerp) and synthesized through an autoregressive language model (LM) together with a Conditional Flow Matching (CFM) network. VoxMorph achieves SOTA results, outperforming existing methods with a \(2.6\times\) improvement in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on the automated speaker verification (ASV) system at strict security thresholds.

Methodology

VoxMorph is a zero-shot framework that overcomes the limitations of prior VIM models. We leverage a dual-encoder architecture to disentangle voice into Prosody (speaking style) and Timbre (vocal identity), then fuse them using Slerp to guide a multi-stage synthesis pipeline.

1. Disentangled Vocal Feature Extraction

From a minimal duration audio sample (\(\geq\)5s), we extract:

Prosody Embedding (\(\mathbf{e}^{P}_i\)): Captures high-level speaking style (rhythm, pitch).
Timbre Embedding (\(\mathbf{e}^{T}_i\)): Encodes core biometric identity (vocal tract texture, formants).

2. Slerp Interpolation

Since embeddings reside on a high-dimensional hypersphere, we employ Spherical Linear Interpolation (Slerp) to fuse the prosody and timbre representations independently. This preserves vector magnitude and traverses the geodesic arc, minimizing audio artifacts:

\mathbf{e}^{(X)}_\alpha = \frac{\sin((1-\alpha)\Omega)}{\sin(\Omega)}\mathbf{e}^{(X)}_A + \frac{\sin(\alpha\Omega)}{\sin(\Omega)}\mathbf{e}^{(X)}_B

where \(X \in \{P,T\}\) and \(\Omega\) is the angle between source embeddings.

3. Multi-Stage Synthesis

The fused embeddings guide a three-stage pipeline to generate the morphed waveform \(W_\alpha\):

Acoustic Token Generation: An autoregressive LM generates discrete tokens conditioned on \(\mathbf{e}^{P}_\alpha\).
Mel-Spectrogram Synthesis: A Conditional Flow Matching (CFM) model renders the spectrogram conditioned on \(\mathbf{e}^{T}_\alpha\).
Waveform Synthesis: A HiFTNet vocoder converts the spectrogram to high-fidelity audio.

Audio Demonstrations

Zero-shot morphing results using unseen speakers from Librispeech.

Source A

Morph

Source B

Example 1: Female-to-Female Morphing

Source A

Morph

Source B

Example 2: Male-to-Male Morphing

Quantitative Results

Table 1: Comparison with state-of-the-art audio morphing methods. VoxMorph-v2 demonstrates superior performance in quality (FAD), intelligibility (WER), and morphing attack success (FMMPMR).

Method	FAD (Lower is better)		KLD (Lower)	WER (Lower)	MMPMR (%) (Higher is better)			FMMPMR (%) (Higher is better)
Method	(Vs Real)	(Vs Clone)	KLD (Lower)	WER (Lower)	@ 0.01%	@ 0.1%	@ 1%	@ 0.01%	@ 0.1%	@ 1%
MorphFader (Texture)	8.96	0.25	0.4332	1.84	-			-
Vevo (Imitation)	9.14	0.63	0.1899	0.54	82.40	94.60	98.80	9.00	44.00	85.60
ViM (Baseline)	7.52	1.52	0.3501	1.06	2.61	29.66	89.38	0.00	5.61	52.10
VoxMorph-v1	5.03	0.24	0.1404	0.33	78.60	98.40	100.0	60.60	96.00	99.80
VoxMorph-v2	4.90	0.27	0.1385	0.19	99.80	100.0	100.0	67.80	97.20	100.0

Ablation Study: Interpolation Method

Table 2: Impact of different interpolation strategies. Slerp preserves the geometric structure of the hypersphere, minimizing audio artifacts.

Method	FAD (Lower)	WER (Lower)	KLD (Lower)	FMMPMR (%) (Higher)
Linear Averaging	5.03	0.1917	0.1854	34.80	83.80
Lerp (Linear Interpolation)	4.96	0.1928	0.1838	62.60	96.80
Lerp (Prosody) + Slerp (Timbre)	5.58	0.1971	0.1826	62.80	96.00
Slerp (Prosody) + Lerp (Timbre)	5.06	0.1926	0.1813	62.60	97.20
Slerp (VoxMorph)	4.90	0.1920	0.1385	67.80	97.20

Ablation Study: Prosody Encoder

Table 3: Analysis of different encoder architectures. The LSTM-based GE2E architecture achieves superior FMMPMR.

Encoder	MMPMR (%)		FMMPMR (%)
Encoder	@ 0.01%	@ 0.1%	@ 0.01%	@ 0.1%
GE2E (Ours)	78.60	98.40	60.60	96.00
ECAPA-TDNN	72.20	98.40	48.20	93.00
HuBERT	75.40	99.00	48.40	91.20
Wav2Vec2	75.80	98.80	47.60	90.60

Additional Results

Table 4: Evaluation of Audio Morphing Using ECAPA-TDNN ASV. MMPMR and FMMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method	MMPMR (%)		FMMPMR (%)
Method	@ 0.01%	@ 0.1%	@ 0.01%	@ 0.1%
ViM	50.60	69.00	14.80	27.80
Ours	62.60	99.80	76.60	79.20

Table 5: Selected Best Similarity Pairs Analysis. FMMPMR and resulting MMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method	MMPMR (%)		FMMPMR (%)
Method	@ 0.01%	@ 0.1%	@ 0.01%	@ 0.1%
ViM	75.00	84.00	40.00	53.00
Ours	100.00	100.00	89.13	100.00

Additional Results

Table 4: Evaluation of Audio Morphing Using ECAPA-TDNN ASV. MMPMR and FMMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method	MMPMR (%)		FMMPMR (%)
Method	@ 0.01%	@ 0.1%	@ 0.01%	@ 0.1%
ViM	50.60	69.00	14.80	27.80
Ours	62.60	99.80	76.60	79.20

Table 5: Selected Best Similarity Pairs Analysis. FMMPMR and resulting MMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method	MMPMR (%)		FMMPMR (%)
Method	@ 0.01%	@ 0.1%	@ 0.01%	@ 0.1%
ViM	75.00	84.00	40.00	53.00
Ours	100.00	100.00	89.13	100.00

Table 6: Evaluation on VoxCeleb Dataset. MMPMR and FMMPMR scores are reported at 0.01%, 0.1%, and 1% FAR thresholds.

Method	MMPMR (%)			FMMPMR (%)
Method	@ 0.01%	@ 0.1%	@ 1%	@ 0.01%	@ 0.1%	@ 1%
ViM	1.42	29.01	86.41	0.00	6.69	49.49
Ours	97.16	100.00	100.00	59.23	97.57	100.00

Acknowledgments

This research was supported by the University of North Texas. We thank the open-source community for foundational tools.

Resemble AI & CosyVoice
Llama Community

BibTeX

@inproceedings{krishnamurthy2026voxmorph,
  title={VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings},
  author={Krishnamurthy, Bharath and Rattani, Ajita},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026}
}

VOXMORPH

Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings

Abstract

Methodology

1. Disentangled Vocal Feature Extraction

2. Slerp Interpolation

3. Multi-Stage Synthesis

Audio Demonstrations

Quantitative Results

Ablation Study: Interpolation Method

Ablation Study: Prosody Encoder

Additional Results

Additional Results

Acknowledgments

BibTeX