Speech synthesis refers to the artificial conversion of text into speech. It is needed in applications where the most convenient modality of interaction with the user is the speech: for instance in hands-free UIs, voice-driven services, and applications targeted for the blind. The aim of voice conversion is to convert the speech spoken by one speaker to sound like the speech of a specific target speaker while maintaining the content of an utterance. It can be used for example in creating various synthesis voices in text-to-speech systems as well as generating various voices from a speaker for dubbing purposes. In addition, knowledge on how to separate the speaker identity and the speech content can lead to improved results in speaker and speech recognition.
Corpus-based speech synthesis
At TUT the speech synthesis research is focused on corpus-based methods, i.e. methods that use real recorded speech data as a basis. The two most widely studied corpus-based synthesis approaches are unit selection and hidden Markov model (HMM) based synthesis. In the concatenative unit selection, synthetic speech is formed by copying and pasting speech segments from a speech database. In the HMM-based synthesis, we employ HMMs that have traditionally been used in speech recognition and try to learn statistical models for speech features (e.g. spectrum, pitch and, phone durations) of a speech database. The resulting models are then used in synthesis to generate artificial speech parameterizations.
- Demo: Prediction of voice aperiodicity in HMM-based speech synthesis
- Demo: Parameterization of vocal fry in HMM-based speech synthesis
- Demo: Effect of training database size in HMM-based speech synthesis
Hybrid-form synthesis combining unit selection and HMM-based synthesis
Both unit selection and HMM-based synthesis have their benefits and challenges. In the most recent approach, HMM-based unit selection, we aim at combining the best of the unit selection and HMM-based speech synthesis: the smooth overall quality of HMM-based synthesis and high segmental quality of unit selection. At TUT, we have achieved progress on HMM-based unit selection by employing multiform synthesis where the poor-quality units are replaced using the underlying HMM-based approach.
Bibliography
Voice conversion
For a human, it is easy to distinguish the speaker identity from the lexical content of the speech. We can easily detect who is speaking and what. The aim of voice conversion is to convert the speech spoken by one speaker to sound like the speech of a specific target speaker while maintaining the content of an utterance. It can be used for example in creating various synthesis voices in TTS systems as well as generating various voices from a speaker for dubbing purposes. In addition, knowledge on how to separate the speaker identity and the speech content can lead to improved results in speaker and speech recognition.
Bibliography