Evaluating the Impact of Prosody Feature Normalization on the Controllability of Pitch in Speech Synthesis

Recent neural text-to-speech (TTS) models are able to synthesize highly natural speech signals using deep learning techniques. In practical applications, it can be desirable to have explicit control over the prosody (speech rate, fundamental frequency, and energy) of the synthesized speech. Such controllability can be achieved by adding prosody prediction modules, whose main purpose is to estimate plausible prosody features for each phoneme in the text input. This explicit modeling also allows for changing prosody features at inference time, consequently enabling the adjustment of the prosody in the synthesized audio. In this paper, we evaluate to which extent deliberate manipulation of such prosody features is reflected in the resulting speech audio. We focus particularly on changing the pitch (i.e., fundamental frequency) while applying different normalization strategies.

Metadaten
Author:	Judith Bauer, Frank Zalkow, Meinard Müller, Christian Dittmar
URN:	urn:nbn:de:bvb:898-opus4-70976
DOI:	https://doi.org/10.35096/othr/pub-7097
ISBN:	978-3-95908-325-6
Parent Title (German):	Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz, Regensburg, 6.-8. März 2024
Publisher:	TUDpress
Place of publication:	Dresden
Editor:	Timo Baumann
Document Type:	conference proceeding (article)
Language:	English
Year of first Publication:	2024
Publishing Institution:	Ostbayerische Technische Hochschule Regensburg
Release Date:	2024/03/08
First Page:	188
Last Page:	195
Andere Schriftenreihe:	Studientexte zur Sprachkommunikation ; 107
Institutes:	Fakultät Informatik und Mathematik
research focus:	Information und Kommunikation
Licence (German):	Keine Lizenz - Es gilt das deutsche Urheberrecht: § 53 UrhG

Open Access