VoxMorph is a zero-shot framework that overcomes the limitations of prior VIM models. We leverage a dual-encoder architecture to disentangle voice into Prosody (speaking style) and Timbre (vocal identity), then fuse them using Slerp to guide a multi-stage synthesis pipeline.
1. Disentangled Vocal Feature Extraction
From a minimal duration audio sample (\(\geq\)5s), we extract:
- Prosody Embedding (\(\mathbf{e}^{P}_i\)): Captures high-level speaking style (rhythm, pitch).
- Timbre Embedding (\(\mathbf{e}^{T}_i\)): Encodes core biometric identity (vocal tract texture, formants).
2. Slerp Interpolation
Since embeddings reside on a high-dimensional hypersphere, we employ Spherical Linear Interpolation (Slerp) to fuse the prosody and timbre representations independently. This preserves vector magnitude and traverses the geodesic arc, minimizing audio artifacts:
where \(X \in \{P,T\}\) and \(\Omega\) is the angle between source embeddings.
3. Multi-Stage Synthesis
The fused embeddings guide a three-stage pipeline to generate the morphed waveform \(W_\alpha\):
- Acoustic Token Generation: An autoregressive LM generates discrete tokens conditioned on \(\mathbf{e}^{P}_\alpha\).
- Mel-Spectrogram Synthesis: A Conditional Flow Matching (CFM) model renders the spectrogram conditioned on \(\mathbf{e}^{T}_\alpha\).
- Waveform Synthesis: A HiFTNet vocoder converts the spectrogram to high-fidelity audio.