Is Speaking Style Transfering Working?

In terms of repair, it’s a great solution that gives you more flexibility.

Video afspelen

It’s frustrating when text-to-speech sounds off.

If your text-to-speech voice would be able to speak in various speaking styles, It would be amazing. 

Support for multiple speaking styles is often a requirement. For example the text to speech voice should be able to resonate a serious tone but for entertaining content it should signal lightheartedness. 

In this use-case the two required speaking styles are quite the opposite. To develop a voice model which is properly capable of expressing these styles you need to record speech data for each speaking style.

But wouldn’t it be amazing if you can train a model based on existing data? Parties that are interested in creating text-to-speech models are often looking for a possibility to use various default speaking style modes or SSML tags to enable their model to express various speaking styles within an application. 

To realize this, we need Cross-speaker style transfer. This means that you can capture a style from one voice or a group of voices, and transfer it to a new text-to-speech voice model. 

Companies that provide text-to-speech solution are investing a lot of research time and effort into this subject to enable new text to speech voices to express various speaking styles without needing more speech data. 

It’s crucial to applications of multi-style and expressive speech synthesis at scale.

It does not require the target speakers to be experts in expressing all styles and to collect corresponding recordings for model training. 

However, the performances of existing style transfer methods are still far behind real needs.

SSML and other interface features that can change prosody are often great for the creation of text-to-speech, but it doesn’t always work. It’s often not enough to get the delivery exactly the way you want.

To get the right emphasis and pronunciation you can use reference audio clips of speakers delivering the sentence that you want and then get the output of your text-to-speech model. 

In terms of repair, it’s a great solution that gives you more flexibility.