AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss - Audio Demo

Kaizhi Qian*, Yang Zhang*, Shiyu Chang, Xuesong Yang, Mark Hasegawa-Johnson

Code Traditional voice conversion Zero-shot voice conversion



Code

Our code is released here.

Traditional Many-to-Many Conversion

(Section 5.2 in the paper)

Traditional many-to-many conversion performs voice conversion from and to speakers that are present in the training set. Four systems are implmented:

  • AutoVC - the proposed autoencoder-based conversion algorithm
  • AutoVC-one-hot - the proposed autoencoder-based conversion algorithm conditioned on one-hot speaker embeddings
  • StarGAN-VC - a voice conversion system that adopts the StarGAN paradigm.
  • Chou et. al. - a voice conversion system combining autoencoder with GAN and speaker classifier.

Below are a few demo audios.

Source Speaker / Speech Target Speaker / Speech Conversion
p270 (Male) p256 (Male) AutoVC
AutoVC-one-hot
StarGAN-VC
Chou et. al.
p228 (Female) AutoVC
AutoVC-one-hot
StarGAN-VC
Chou et. al.
p225 (Female) p256 (Male) AutoVC
AutoVC-one-hot
StarGAN-VC
Chou et. al.
p228 (Female) AutoVC
AutoVC-one-hot
StarGAN-VC
Chou et. al.
Back to Top Back to Section Start



Zero-Shot Voice Conversion

(Section 5.3 in the paper)

Zero-shot voice conversion performs conversion from and/or to speakers that are unseen during training, based on only 20 seconds of audio of the speakers. Only AutoVC is implemented for zero-shot voice conversion.

The following table shows conversions to seen speakers.

Target Speakers / Speech
P227 (Seen male) P225 (Seen female)
Source Speaker / Speech P227 (Seen male)
P225 (Seen female)
P252 (Unseen male)
P261 (Seen female)

The following table shows conversions to unseen speakers.

Target Speakers / Speech
P252 (Uneen male) P261 (Unseen female)
Source Speaker / Speech P227 (Seen male)
P225 (Seen female)
P252 (Unseen male)
P261 (Seen female)
Back to Top Back to Section Start