Skip to the content.

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data (INTERSPEECH 2023)

Authors

Abstract

We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single <unit, speech> pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a unit sequence as input.

Real-world Data

Adaptive Text-to-Speech

Transcript: This audio was generated by UnitSpeech for Barack Obama.

Reference UnitSpeech
Barack Obama (01:07 ~ 01:17)

Transcript: We use ten second untranscribed speech from Sonny’s interview.

Reference UnitSpeech
Heung-min Son (00:04 ~ 00:14)

Transcript: We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes the pre-trained diffusion-based text-to-speech (TTS) model using minimal untranscribed data.

Reference UnitSpeech
Gollum (00:30 ~ 00:40)

Voice Conversion (any-to-any)

Click here to view the source transcript. We did not use the transcript for generating samples below. Transcript: Einstein's theory of relativity is e equals m c squared.
Reference Source UnitSpeech
Steve Jobs (00:55 ~ 01:05) Tom Hiddleston (00:34 ~ 00:41)
Click here to view the source transcript. We did not use the transcript for generating samples below. Transcript: We are taking steps to make today's commencement feel as authentic as possible.
Reference Source UnitSpeech
Emma Watson (03:30 ~ 03:40) Conan O'Brien (03:08 ~ 03:13)
Click here to view the source transcript. We did not use the transcript for generating samples below. Transcript: And always have the courage to be yourself. Most importantly, you have to do what you love.
Reference Source UnitSpeech
Marge Simpson (00:42 ~ 00:52) Donald Trump (19:44 ~ 19:54.3)

LibriTTS Dataset

Adaptive Text-to-Speech

Transcript: As a matter of fact, the drawn curtain disclosed nothing but three or four suits of clothes hanging from a line of pegs.

Sampling Rate Reference GT GT Mel+HiFi-GAN UnitSpeech Guided-TTS 2 Guided-TTS 2 (zero-shot) YourTTS
22,050Hz
16,000Hz

Transcript: There was a savory stew, smoking hot, a dish of blue peas, a bowl of sweet milk of a delicate blue tint and a blue pudding with blue plums in it.

Sampling Rate Reference GT GT Mel+HiFi-GAN UnitSpeech Guided-TTS 2 Guided-TTS 2 (zero-shot) YourTTS
22,050Hz
16,000Hz

Transcript: Nevertheless, when the end of the summer came and the only opening facing her was the teaching of children at Miss Smith’s experiment in the Alabama swamps, it must be frankly confessed that Miss Taylor was disappointed.

Sampling Rate Reference GT GT Mel+HiFi-GAN UnitSpeech Guided-TTS 2 Guided-TTS 2 (zero-shot) YourTTS
22,050Hz
16,000Hz

Voice Conversion (any-to-any)

Click here to view the source transcript. We did not use the transcript for generating samples below. Transcript: She wandered in the land of clouds thro' valleys dark, listning Dolors and lamentations: waiting oft beside the dewy grave She stood in silence, listning to the voices of the ground, Till to her own grave plot she came, and there she sat down. And heard this voice of sorrow breathed from the hollow pit.
Sampling Rate Reference Source UnitSpeech DiffVC BNE-PPG-VC YourTTS
22,050Hz
16,000Hz
Click here to view the source transcript. We did not use the transcript for generating samples below. Transcript: Instead of shoes, the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid.
Sampling Rate Reference Source UnitSpeech DiffVC BNE-PPG-VC YourTTS
22,050Hz
16,000Hz
Click here to view the source transcript. We did not use the transcript for generating samples below. Transcript: Hans pointed with his finger at a dark mass six hundred yards away, rising and falling alternately with heavy plunges.
Sampling Rate Reference Source UnitSpeech DiffVC BNE-PPG-VC YourTTS
22,050Hz
16,000Hz

Other Unit-based Task

Adaptive Speech Synthesis Module for Speech-to-Unit Translation (S2UT)

Click here to view the source text and reference translation. Source text: Algunos de los Oficiales actuales pertenecen a esas pasadas generaciones.
Reference translation: Some of the current Officials belong to these past generations.
Sampling Rate Source (ES) S2UT + UnitSpeech (EN) S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz
Click here to view the source text and reference translation. Source text: Se usan complejos modelos de software, como el modelo de clima global.
Reference translation: Complex software models such as the global climate model are used.
Sampling Rate Source (ES) S2UT + UnitSpeech (EN) S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz
Click here to view the source text and reference translation. Source text: Habita en Guinea Ecuatorial, Camerún, República Centroafricana y Gabón.
Reference translation: Living in Equatorial Guinea, Cameroon, Central African Republic and Gabon.
Sampling Rate Source (ES) S2UT + UnitSpeech (EN) S2UT + unit-HiFi-GAN (EN)
22,050Hz
16,000Hz

Citation

@misc{kim2023unitspeech,
      title={UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data}, 
      author={Heeseung Kim and Sungwon Kim and Jiheum Yeom and Sungroh Yoon},
      year={2023},
      eprint={2306.16083},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}