Voice conversion systems convert a spoken utterance from one person into the voice of another person. It can be seen as asking a model to repeat a recorded sentence with the voice of another person. This type of technology gets a lot of attention from the entertainment industry because it unlocks interesting possibilities. For instance, it can facilitate the resurrection of beloved characters for which the original actor is no longer available (e.g. young Luke Skywalker in The Mandalorian ). It allows characters to have the same voice in all languages. It makes it possible for players to generate content for videogames by letting them record dialog lines and convert them to the voice of the game characters.
Figure 1: Voice conversion change the voice of a spoken utterance without altering its textual and prosodic content.
Here are some examples of converted voices using our voice conversion system:
A voice conversion system must encode the phonetic content and the prosody of a spoken line while discarding the speaker identity information. Speaker identity information refers to how the voice of the person sounds. The phonetic content is the information about the words that have been said, while the prosody refers to how the words were said. Some describe it as the melody of the sentence. Once a source speech line is encoded, it is reconstructed in a different voice.
This blog discusses the representation of speech in a voice conversion model. More specifically, we explain the intuition behind soft speech units that improve upon traditional discrete speech units. Soft speech units were introduced in:
B. van Niekerk, M.-A. Carbonneau, J. Zaïdi, M. Baas, H. Seuté and H. Kamper, "A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022
There are several ways to represent phonetic and prosodic information. In this blog, we focus on learned speech units. Speech units are symbols that represent typical speech pattern over a short period of time. In other words, each short speech segment, or frame, is summarized by a symbol from a predefined dictionary. The dictionary of symbols is not specified by a human, but rather discovered from data in a learning process.
Figure 2: Each speech frame is represented by a symbol from a learned dictionary. These symbols are speech units.
Here are the steps to learn a dictionary of speech units:
Collect a large amount of speech data
Split all the data in short frames and encode them in a latent space using a feature extractor (e.g. MFCC, HuBERT or CPC)
Perform clustering on all encoded frames using your favorite algorithm (e.g. k-means)
The centers of the discovered clusters are speech units.
Figure 3: How to learn a dictionary of discrete speech units.
The process is similar when encoding a speech line. First split the line into short frames, extract features and attribute them to the closest cluster centers. This means that a raw audio signal is translated to a series of cluster IDs. This process tends to discard information related to the speaker identity while preserving phonetic and prosodic information.
Soft speech units
However, by hard-assigning every bit of speech to a cluster center, some phonetic information is also lost. It is impossible to tell how close to other clusters the frame fell, which can lead to mispronunciations. Take the word fin, for example. Ambiguous frames in the fricative /f/ may be assigned to incorrect nearby units, resulting in the mispronunciation thin.
This is what motivated us to propose a new way of encoding speech units that models assignment ambiguity. We call them soft-speech units as opposed to the hard speech unit assignation scheme that we just described. By modeling ambiguity in discrete unit assignments, we retain more content information and, as a result, correct mispronunciations like fin/thin. This idea is inspired by soft-assignment in computer vision, which has been shown to improve performance on classification tasks .
It is tempting to directly use the output of the feature extractor without clustering as speech units. However, previous work [3, 4] and our experiments show that these representations contain large amounts of speaker information, rendering them unsuitable for voice conversion. Instead, we train a classifier that predicts which discrete unit a frame will be assigned. Intuitively, the classifier would hesitate between possible label candidates when assignment is ambiguous.
Figure 4: How to obtain a soft speech unit encoder by training a discrete unit classifier.
The soft unit encoder is trained with supervision. First a data set is created by extracting discrete speech units (as described earlier) from speech recordings. The discrete unit IDs are the target labels associated with speech frames. A unit classifier is trained to predict the label of each frame. Typically, the classifier is a simple neural network placed on top of the feature extractor. The unit classifier produces a distribution over all discrete speech units. Inside the classifier, this distribution is parametrized by what will be used as soft speech units.
The training procedure inspired from  is illustrated in Figures 4 and 5. Given a speech frame and its discrete unit label, the classifier's weights are updated as follows: First the feature extractor processes the raw speech frame. Then, a linear layer projects its output to produce a soft speech unit. The soft unit parameterizes a distribution over the dictionary of discrete speech units. This is done by measuring the cosine similarity between the soft unit and each discrete unit in the dictionary. The discrete units are represented by a learnable embedding. A soft-max operation is applied over all similarity measures which results in a multinomial distribution. This allows to compute the cross-entropy between the output distribution and a one-hot encoding of the discrete speech unit label as often done when training a classifier. Finally, we minimize the average cross-entropy to update the entire network, including the feature extractor. In the final voice conversion system, only the soft unit encoder is kept.
Figure 5: Training procedure for the unit classifier.
Building a Voice Conversion System using Soft Speech Units
Here we describe how we built our best performing voice conversion system in the paper. After obtaining the soft speech units from raw source speech, we produce the converted speech file using models typically used in text-to-speech (TTS) systems.
Figure 6: Overview of a complete voice conversion system using soft speech units.
We use HuBERT as a feature extractor. HuBERT is a transformer-based self-supervised model that was proposed as a general-purpose speech feature extractor. We use a pretrained model provided by the original authors: HuBERT-Base4  pretrained on LibriSpeech-960 .
Discrete Unit discovery
To learn discrete speech units, we apply k-means clustering to intermediate representations from HuBERT. We use the seventh transformer layer because the resulting acoustic units perform well on phone discrimination tests [3, 4, 5]. We discover 100 clusters and estimate their means on a subset of 100 speakers from the LibriSpeech data set.
Soft Unit Encoder
We extract the discrete speech units of all speech samples in LibriSpeech. These discrete speech units serve as targets when training the unit classifier. The unit classifier is a simple linear layer added on top of the feature extractor. We fine-tune the whole model, including HuBERT on LibriSpeech-960 to predict the corresponding discrete speech units using the procedure described above.
The soft speech units are fed to a model largely inspired by text-to-speech systems. This model is trained only on the target speaker. In the paper our TTS-like model is broken down into two parts: acoustic model and vocoder.
We adapt the acoustic model for voice conversion by changing its input to speech units rather than graphemes or phonemes. It translates these speech units into a spectrogram. Then, the vocoder converts the predicted spectrogram into audio samples.
Our acoustic model is based on Tacotron 2  and consists of an encoder and autoregressive decoder. The attention module of the Tacotron 2 model is not required since speech units and spectrogram frames are time aligned. For the vocoder, we use HiFi-GAN  in its original form which is designed for spectrogram inputs.
In the paper we used an acoustic model and a vocoder as described above, but we recommend using only a vocoder trained directly on speech units.
Experiments and Results
We conducted comparative experiments in which we replaced the discrete speech units used in state-of-the-art models with our proposed soft speech units. We used CPC and HuBERT pre-trained models as feature extractors and measured intelligibility, speaker similarity and naturalness for discrete and soft units as well as for raw features (i.e. features not translated into speech units). Additionally, for reference, we measured the performance of 2 baseline models as well as the ground truth recordings.
Follow this link to listen to more examples from the experiments: https://ubisoft-laforge.github.io/speech/soft-vc/
To assess the intelligibility of the converted speech, we measure word error rate (WER) and phoneme error rate (PER) using an automatic speech recognition system. Lower error rates correspond to more intelligible speech since it shows that the original words are still recognizable after conversion.
We measure speaker-similarity using a trained speaker-verification system. We report equal-error rate (EER), which approaches 50% when the verification system cannot distinguish between converted and genuine target-speaker utterances (indicating high speaker similarity).
For naturalness, we conduct a subjective evaluation based on mean opinion scores (MOS). Evaluators rate the naturalness of the speech samples on a five-point scale (from 1=bad to 5=excellent). We report MOS and their 95% confidence intervals.
Our results indicate that replacing discrete speech units with soft speech units both increases intelligibility and naturalness. The HuBERT-based model even obtained speech recognition error rates comparable to real recorded speech. This comes at the expense of leaking a bit of speaker identity information from the source speaker into the speech units. This leakage translates in a lower EER from the speaker recognizer. However, EER in this range pertain to differences that are difficult to perceive for humans.
Table 1: Experimental results from our comparative study
Figure 7 shows where the pronunciation improvements come from in the HuBERT-based model. It compiles the error rate per phoneme. Most of the improvements come from the consonants, particularly from the affricative /ʧ/ (__ch__in), the fricative /ʒ/ (__j__oke), and the velar stops /k/ (__k__id) and /ɡ/ (__g__o).
Figure 7: Breakdown of PER (%) per phoneme for HuBERT-Discrete and HuBERT-Soft.
To summarize, the results show that soft assignments capture more linguistic content, improving intelligibility compared to discrete units. This comes at the price of identity leakage from the source speaker, which might not be perceptible.
We proposed soft speech units to improve unsupervised voice conversion. We showed that soft units are a good middle-ground between discrete and continuous features - they accurately represent linguistic content while still discarding speaker information. Our evaluations showed that soft units improve intelligibility and naturalness. Future work will investigate soft speech units for any-to-any voice conversion.
 "Making of season 2 finale," Disney Gallery: The Mandalorian, 2021. https://dmedmedia.disney.com/disney-plus/disney-gallery-the-mandalorian/making-of-the-season-2-finale
 Jan van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, and Jan-Mark Geusebroek, "Visual Word Ambiguity," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
 Benjamin van Niekerk, Leanne Nortje, Matthew Baas, and Herman Kamper, "Analyzing speaker information in self-supervised models to improve zero-resource speech processing," INTERSPEECH, 2021.
 Shu wen Yang et al., "Superb: Speech processing universal performance benchmark," INTERSPEECH, 2021.
 Wei-Ning Hsu et al., "Hubert: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
 Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, "Librispeech: an ASR corpus based on public domain audio books," ICASSP, 2015.
 Jonathan Shen et al., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions," ICASSP, 2018.
 Jungil Kong et al., "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," NeurIPS, 2020.