VOXMORPH

Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings

ICASSP 2026
University of North Texas
VoxMorph Architecture
Figure 1: Architectural overview of the VoxMorph framework. The process consists of three core stages: (1) Extraction of disentangled prosody and timbre embeddings; (2) Interpolation via Spherical Linear Interpolation (Slerp); and (3) Synthesis via an autoregressive language model and Conditional Flow Matching (CFM) network.

Abstract

Morphing attacks threaten biometric security by creating synthetic samples that can impersonate multiple individuals. While extensively studied for face recognition, this vulnerability remains largely unexplored for voice biometrics. The only prior work on voice morphing is computationally expensive, non-scalable, and restricted to acoustically similar identity pairs. We propose VoxMorph, a novel zero-shot framework that generates high-fidelity voice morphs from as little as five seconds of audio per subject, without model retraining. Our approach disentangles vocal characteristics into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are blended via Spherical Linear Interpolation (Slerp) and synthesized through an autoregressive language model (LM) together with a Conditional Flow Matching (CFM) network. VoxMorph achieves SOTA results, outperforming existing methods with a \(2.6\times\) improvement in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on the automated speaker verification (ASV) system at strict security thresholds.

Methodology

VoxMorph is a zero-shot framework that overcomes the limitations of prior VIM models. We leverage a dual-encoder architecture to disentangle voice into Prosody (speaking style) and Timbre (vocal identity), then fuse them using Slerp to guide a multi-stage synthesis pipeline.

1. Disentangled Vocal Feature Extraction

From a minimal duration audio sample (\(\geq\)5s), we extract:

  • Prosody Embedding (\(\mathbf{e}^{P}_i\)): Captures high-level speaking style (rhythm, pitch).
  • Timbre Embedding (\(\mathbf{e}^{T}_i\)): Encodes core biometric identity (vocal tract texture, formants).

2. Slerp Interpolation

Since embeddings reside on a high-dimensional hypersphere, we employ Spherical Linear Interpolation (Slerp) to fuse the prosody and timbre representations independently. This preserves vector magnitude and traverses the geodesic arc, minimizing audio artifacts:

$$ \mathbf{e}^{(X)}_\alpha = \frac{\sin((1-\alpha)\Omega)}{\sin(\Omega)}\mathbf{e}^{(X)}_A + \frac{\sin(\alpha\Omega)}{\sin(\Omega)}\mathbf{e}^{(X)}_B $$

where \(X \in \{P,T\}\) and \(\Omega\) is the angle between source embeddings.

3. Multi-Stage Synthesis

The fused embeddings guide a three-stage pipeline to generate the morphed waveform \(W_\alpha\):

  1. Acoustic Token Generation: An autoregressive LM generates discrete tokens conditioned on \(\mathbf{e}^{P}_\alpha\).
  2. Mel-Spectrogram Synthesis: A Conditional Flow Matching (CFM) model renders the spectrogram conditioned on \(\mathbf{e}^{T}_\alpha\).
  3. Waveform Synthesis: A HiFTNet vocoder converts the spectrogram to high-fidelity audio.

Audio Demonstrations

Zero-shot morphing results using unseen speakers from Librispeech.

Source A
Morph
Source B

Example 1: Female-to-Female Morphing

Source A
Morph
Source B

Example 2: Male-to-Male Morphing

Quantitative Results

Table 1: Comparison with state-of-the-art audio morphing methods. VoxMorph-v2 demonstrates superior performance in quality (FAD), intelligibility (WER), and morphing attack success (FMMPMR).

Method FAD
(Lower is better)
KLD
(Lower)
WER
(Lower)
MMPMR (%)
(Higher is better)
FMMPMR (%)
(Higher is better)
(Vs Real) (Vs Clone) @ 0.01% @ 0.1% @ 1% @ 0.01% @ 0.1% @ 1%
MorphFader
(Texture)
8.96 0.25 0.4332 1.84 - -
Vevo
(Imitation)
9.14 0.63 0.1899 0.54 82.40 94.60 98.80 9.00 44.00 85.60
ViM
(Baseline)
7.52 1.52 0.3501 1.06 2.61 29.66 89.38 0.00 5.61 52.10
VoxMorph-v1 5.03 0.24 0.1404 0.33 78.60 98.40 100.0 60.60 96.00 99.80
VoxMorph-v2 4.90 0.27 0.1385 0.19 99.80 100.0 100.0 67.80 97.20 100.0

Ablation Study: Interpolation Method

Table 2: Impact of different interpolation strategies. Slerp preserves the geometric structure of the hypersphere, minimizing audio artifacts.

Method FAD
(Lower)
WER
(Lower)
KLD
(Lower)
FMMPMR (%)
(Higher)
Linear Averaging 5.03 0.1917 0.1854 34.80 83.80
Lerp (Linear Interpolation) 4.96 0.1928 0.1838 62.60 96.80
Lerp (Prosody) + Slerp (Timbre) 5.58 0.1971 0.1826 62.80 96.00
Slerp (Prosody) + Lerp (Timbre) 5.06 0.1926 0.1813 62.60 97.20
Slerp (VoxMorph) 4.90 0.1920 0.1385 67.80 97.20

Ablation Study: Prosody Encoder

Table 3: Analysis of different encoder architectures. The LSTM-based GE2E architecture achieves superior FMMPMR.

Encoder MMPMR (%) FMMPMR (%)
@ 0.01% @ 0.1% @ 0.01% @ 0.1%
GE2E
(Ours)
78.60 98.40 60.60 96.00
ECAPA-TDNN 72.20 98.40 48.20 93.00
HuBERT 75.40 99.00 48.40 91.20
Wav2Vec2 75.80 98.80 47.60 90.60

Additional Results

Table 4: Evaluation of Audio Morphing Using ECAPA-TDNN ASV. MMPMR and FMMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method MMPMR (%) FMMPMR (%)
@ 0.01% @ 0.1% @ 0.01% @ 0.1%
ViM 50.60 69.00 14.80 27.80
Ours 62.60 99.80 76.60 79.20

Table 5: Selected Best Similarity Pairs Analysis. FMMPMR and resulting MMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method MMPMR (%) FMMPMR (%)
@ 0.01% @ 0.1% @ 0.01% @ 0.1%
ViM 75.00 84.00 40.00 53.00
Ours 100.00 100.00 89.13 100.00

Additional Results

Table 4: Evaluation of Audio Morphing Using ECAPA-TDNN ASV. MMPMR and FMMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method MMPMR (%) FMMPMR (%)
@ 0.01% @ 0.1% @ 0.01% @ 0.1%
ViM 50.60 69.00 14.80 27.80
Ours 62.60 99.80 76.60 79.20

Table 5: Selected Best Similarity Pairs Analysis. FMMPMR and resulting MMPMR scores are reported at the 0.01% and 0.1% FAR thresholds.

Method MMPMR (%) FMMPMR (%)
@ 0.01% @ 0.1% @ 0.01% @ 0.1%
ViM 75.00 84.00 40.00 53.00
Ours 100.00 100.00 89.13 100.00

Table 6: Evaluation on VoxCeleb Dataset. MMPMR and FMMPMR scores are reported at 0.01%, 0.1%, and 1% FAR thresholds.

Method MMPMR (%) FMMPMR (%)
@ 0.01% @ 0.1% @ 1% @ 0.01% @ 0.1% @ 1%
ViM 1.42 29.01 86.41 0.00 6.69 49.49
Ours 97.16 100.00 100.00 59.23 97.57 100.00

Acknowledgments

This research was supported by the University of North Texas. We thank the open-source community for foundational tools.

  • Resemble AI & CosyVoice
  • Llama Community

BibTeX

@inproceedings{krishnamurthy2026voxmorph,
  title={VoxMorph: Scalable Zero-Shot Voice Identity Morphing via Disentangled Embeddings},
  author={Krishnamurthy, Bharath and Rattani, Ajita},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026}
}

Website template adapted from Nerfies.
Copyright © 2026 University of North Texas.