UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data (INTERSPEECH 2023)

Paper
Code

Authors

Heeseung Kim gmltmd789@snu.ac.kr
Sungwon Kim ksw0306@snu.ac.kr
Jiheum Yeom quilava1234@snu.ac.kr
Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr

Abstract

We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single <unit, speech> pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a unit sequence as input.

Real-world Data

Adaptive Text-to-Speech

Transcript: This audio was generated by UnitSpeech for Barack Obama.

Reference	UnitSpeech
Barack Obama (01:07 ~ 01:17)

Transcript: We use ten second untranscribed speech from Sonny’s interview.

Reference	UnitSpeech
Heung-min Son (00:04 ~ 00:14)

Transcript: We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes the pre-trained diffusion-based text-to-speech (TTS) model using minimal untranscribed data.

Reference	UnitSpeech
Gollum (00:30 ~ 00:40)

Voice Conversion (any-to-any)

Click here to view the source transcript. We did not use the transcript for generating samples below.

Transcript: Einstein's theory of relativity is e equals m c squared.

Reference	Source	UnitSpeech
Steve Jobs (00:55 ~ 01:05)	Tom Hiddleston (00:34 ~ 00:41)

Click here to view the source transcript. We did not use the transcript for generating samples below.

Transcript: We are taking steps to make today's commencement feel as authentic as possible.

Reference	Source	UnitSpeech
Emma Watson (03:30 ~ 03:40)	Conan O'Brien (03:08 ~ 03:13)

Click here to view the source transcript. We did not use the transcript for generating samples below.

Transcript: And always have the courage to be yourself. Most importantly, you have to do what you love.

Reference	Source	UnitSpeech
Marge Simpson (00:42 ~ 00:52)	Donald Trump (19:44 ~ 19:54.3)

LibriTTS Dataset

Adaptive Text-to-Speech

Transcript: As a matter of fact, the drawn curtain disclosed nothing but three or four suits of clothes hanging from a line of pegs.

Sampling Rate	Reference	GT	GT Mel+HiFi-GAN	UnitSpeech	Guided-TTS 2	Guided-TTS 2 (zero-shot)	YourTTS
22,050Hz
16,000Hz

Transcript: There was a savory stew, smoking hot, a dish of blue peas, a bowl of sweet milk of a delicate blue tint and a blue pudding with blue plums in it.

Sampling Rate	Reference	GT	GT Mel+HiFi-GAN	UnitSpeech	Guided-TTS 2	Guided-TTS 2 (zero-shot)	YourTTS
22,050Hz
16,000Hz

Transcript: Nevertheless, when the end of the summer came and the only opening facing her was the teaching of children at Miss Smith’s experiment in the Alabama swamps, it must be frankly confessed that Miss Taylor was disappointed.

Sampling Rate	Reference	GT	GT Mel+HiFi-GAN	UnitSpeech	Guided-TTS 2	Guided-TTS 2 (zero-shot)	YourTTS
22,050Hz
16,000Hz

Voice Conversion (any-to-any)

Click here to view the source transcript. We did not use the transcript for generating samples below.

Transcript: She wandered in the land of clouds thro' valleys dark, listning Dolors and lamentations: waiting oft beside the dewy grave She stood in silence, listning to the voices of the ground, Till to her own grave plot she came, and there she sat down. And heard this voice of sorrow breathed from the hollow pit.

Sampling Rate	Reference	Source	UnitSpeech	DiffVC	BNE-PPG-VC	YourTTS
22,050Hz
16,000Hz

Click here to view the source transcript. We did not use the transcript for generating samples below.

Transcript: Instead of shoes, the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid.

Sampling Rate	Reference	Source	UnitSpeech	DiffVC	BNE-PPG-VC	YourTTS
22,050Hz
16,000Hz

Click here to view the source transcript. We did not use the transcript for generating samples below.

Transcript: Hans pointed with his finger at a dark mass six hundred yards away, rising and falling alternately with heavy plunges.

Sampling Rate	Reference	Source	UnitSpeech	DiffVC	BNE-PPG-VC	YourTTS
22,050Hz
16,000Hz

Other Unit-based Task

Adaptive Speech Synthesis Module for Speech-to-Unit Translation (S2UT)

For this experiment, we trained UnitSpeech using 1000 clustered units commonly used in speech-to-unit translation.
Existing speech-to-unit models use unit-HiFi-GAN trained on a single speaker (LJSpeech) as a speech synthesis module when generating the target language’s speech.
Replacing the unit-HiFi-GAN with UnitSpeech and combining it with a pre-trained speech-to-unit model, we show the possibility of personalization of speech-to-speech translation.
We conducted an experiment using the CoVoST 2 dataset.
Due to imperfections in the pre-trained speech-to-unit model, UnitSpeeh may generate incorrect translation speech.

Click here to view the source text and reference translation.

Source text: Algunos de los Oficiales actuales pertenecen a esas pasadas generaciones.
Reference translation: Some of the current Officials belong to these past generations.

Sampling Rate	Source (ES)	S2UT + UnitSpeech (EN)	S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz

Click here to view the source text and reference translation.

Source text: Se usan complejos modelos de software, como el modelo de clima global.
Reference translation: Complex software models such as the global climate model are used.

Sampling Rate	Source (ES)	S2UT + UnitSpeech (EN)	S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz

Click here to view the source text and reference translation.

Source text: Habita en Guinea Ecuatorial, Camerún, República Centroafricana y Gabón.
Reference translation: Living in Equatorial Guinea, Cameroon, Central African Republic and Gabon.

Sampling Rate	Source (ES)	S2UT + UnitSpeech (EN)	S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz

Citation

@misc{kim2023unitspeech,
      title={UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data}, 
      author={Heeseung Kim and Sungwon Kim and Jiheum Yeom and Sungroh Yoon},
      year={2023},
      eprint={2306.16083},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}