VC methods are selected based on the following criteria:
The target can be a single speaker from the pool or an average of many speakers.
All the data subsets are derived from LibriSpeech corpus.
| Component | Dataset | Training |
|---|---|---|
| VoiceMask | None | No training required, ($\alpha$, $\beta$ are selected randomly from a predefined range) |
| Disentangled VC | LibriTTS 100h | (Content, Speaker) encoders are trained end-to-end to reconstruct the speech waveform |
| VTLN | LibriSpeech 460h | K-means clusters (8 centroids) and transformation paramters are learnt |
| Attackers | Anonymized training data to induce "knowledge" of anonymization | |
| ASR for evaluation | End-to-End (CTC + Attention) ASR is trained | |
| Male | Female | |
|---|---|---|
| #Speakers | 13 | 16 |
| Genuine trials | 449 | 548 |
| Impostor trials | 9457 | 11,196 |