We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single <unit, speech> pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a unit sequence as input.
Real-world Data
Adaptive Text-to-Speech
Transcript: This audio was generated by UnitSpeech for Barack Obama.
Transcript: We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes the pre-trained diffusion-based text-to-speech (TTS) model using minimal untranscribed data.
Click here to view the source transcript. We did not use the transcript for generating samples below.
Transcript: Einstein's theory of relativity is e equals m c squared.
Click here to view the source transcript. We did not use the transcript for generating samples below.
Transcript: We are taking steps to make today's commencement feel as authentic as possible.
Click here to view the source transcript. We did not use the transcript for generating samples below.
Transcript: And always have the courage to be yourself. Most importantly, you have to do what you love.
Transcript: As a matter of fact, the drawn curtain disclosed nothing but three or four suits of clothes hanging from a line of pegs.
Sampling Rate
Reference
GT
GT Mel+HiFi-GAN
UnitSpeech
Guided-TTS 2
Guided-TTS 2 (zero-shot)
YourTTS
22,050Hz
16,000Hz
Transcript: There was a savory stew, smoking hot, a dish of blue peas, a bowl of sweet milk of a delicate blue tint and a blue pudding with blue plums in it.
Sampling Rate
Reference
GT
GT Mel+HiFi-GAN
UnitSpeech
Guided-TTS 2
Guided-TTS 2 (zero-shot)
YourTTS
22,050Hz
16,000Hz
Transcript: Nevertheless, when the end of the summer came and the only opening facing her was the teaching of children at Miss Smith’s experiment in the Alabama swamps, it must be frankly confessed that Miss Taylor was disappointed.
Sampling Rate
Reference
GT
GT Mel+HiFi-GAN
UnitSpeech
Guided-TTS 2
Guided-TTS 2 (zero-shot)
YourTTS
22,050Hz
16,000Hz
Voice Conversion (any-to-any)
Click here to view the source transcript. We did not use the transcript for generating samples below.
Transcript: She wandered in the land of clouds thro' valleys dark, listning Dolors and lamentations: waiting oft beside the dewy grave She stood in silence, listning to the voices of the ground, Till to her own grave plot she came, and there she sat down. And heard this voice of sorrow breathed from the hollow pit.
Sampling Rate
Reference
Source
UnitSpeech
DiffVC
BNE-PPG-VC
YourTTS
22,050Hz
16,000Hz
Click here to view the source transcript. We did not use the transcript for generating samples below.
Transcript: Instead of shoes, the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid.
Sampling Rate
Reference
Source
UnitSpeech
DiffVC
BNE-PPG-VC
YourTTS
22,050Hz
16,000Hz
Click here to view the source transcript. We did not use the transcript for generating samples below.
Transcript: Hans pointed with his finger at a dark mass six hundred yards away, rising and falling alternately with heavy plunges.
Sampling Rate
Reference
Source
UnitSpeech
DiffVC
BNE-PPG-VC
YourTTS
22,050Hz
16,000Hz
Other Unit-based Task
Adaptive Speech Synthesis Module for Speech-to-Unit Translation (S2UT)
For this experiment, we trained UnitSpeech using 1000 clustered units commonly used in speech-to-unit translation.
Existing speech-to-unit models use unit-HiFi-GAN trained on a single speaker (LJSpeech) as a speech synthesis module when generating the target language’s speech.
Replacing the unit-HiFi-GAN with UnitSpeech and combining it with a pre-trained speech-to-unit model, we show the possibility of personalization of speech-to-speech translation.
We conducted an experiment using the CoVoST 2 dataset.
Due to imperfections in the pre-trained speech-to-unit model, UnitSpeeh may generate incorrect translation speech.
Click here to view the source text and reference translation.
Source text: Algunos de los Oficiales actuales pertenecen a esas pasadas generaciones.
Reference translation: Some of the current Officials belong to these past generations.
Sampling Rate
Source (ES)
S2UT + UnitSpeech (EN)
S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz
Click here to view the source text and reference translation.
Source text: Se usan complejos modelos de software, como el modelo de clima global.
Reference translation: Complex software models such as the global climate model are used.
Sampling Rate
Source (ES)
S2UT + UnitSpeech (EN)
S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz
Click here to view the source text and reference translation.
Source text: Habita en Guinea Ecuatorial, Camerún, República Centroafricana y Gabón.
Reference translation: Living in Equatorial Guinea, Cameroon, Central African Republic and Gabon.
Sampling Rate
Source (ES)
S2UT + UnitSpeech (EN)
S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz
Citation
@misc{kim2023unitspeech,
title={UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data},
author={Heeseung Kim and Sungwon Kim and Jiheum Yeom and Sungroh Yoon},
year={2023},
eprint={2306.16083},
archivePrefix={arXiv},
primaryClass={cs.SD}
}