VC methods are selected based on the following criteria:
The target can be a single speaker from the pool or an average of many speakers.
All the data subsets are derived from LibriSpeech corpus.
Component | Dataset | Training |
---|---|---|
VoiceMask | None | No training required, ($\alpha$, $\beta$ are selected randomly from a predefined range) |
Disentangled VC | LibriTTS 100h | (Content, Speaker) encoders are trained end-to-end to reconstruct the speech waveform |
VTLN | LibriSpeech 460h | K-means clusters (8 centroids) and transformation paramters are learnt |
Attackers | Anonymized training data to induce "knowledge" of anonymization | |
ASR for evaluation | End-to-End (CTC + Attention) ASR is trained | |
Male | Female | |
---|---|---|
#Speakers | 13 | 16 |
Genuine trials | 449 | 548 |
Impostor trials | 9457 | 11,196 |