Logo Logo
Hilfe
Hilfe
Switch Language to English

Reichel, Uwe D. (18. März 2010): Datenbasierte und linguistisch interpretierbare Intonationsmodellierung. Dissertation, Ludwig-Maximilians-Universität München
[PDF, 1MB]

[thumbnail of Reichel_Diss.pdf]
Vorschau
Download (1MB)

Abstract

In this thesis a data-driven and linguistically interpretable intonation model for the automatic analysis and synthesis of fundamental frequency (F0) contours was developed. The intonation model: The model can be characterised as parametric, contour-based, and superpositional. F0 contours are treated as a superposition of global and local components. These components are anchored in a hierarchic prosodic structure defined by global and local segments which correspond roughly to intonation phrases and accent groups respectively. The stylisation of the F0 contours is carried out as follows: Within each global segment a linear F0 base contour is fitted. After the subtraction of this global baseline a third order polynomial is fitted to the F0 residual within each local segment. Subsequently, a symbolic description of the intonation inventory in form of global and local contour classes is derived by polynomial coefficient clustering. On the phonetic level, linear regression models adjust these abstract units to the respective prosodic context. As to the parametric and contour-based description, the model stands in the tradition of Fujisaki (1987), Möhler (1998b) and Taylor (2000). As to superposition, it stands in the tradition of Fujisaki (1987). As in Möhler und Conkie (1998) stylisation parameter clustering is carried out. Regarding the following aspects the approach chosen here provides additional benefit to intonation research: (1) The requirements for data preprocessing are comparably low. F0 stylisation was carried out in F0 sections at syllable nuclei, rendering an exact syllable segmentation unnecessary. The extraction of the prosodic structure is restricted to prosodic phrase boundaries guided by signal pauses, punctuation and partof-speech information. Pitch accent localisation and classification is not needed. Due to this a complete automation of the preprocessing steps with acceptable quality is achieved, so that there is no need for a manual data preparation by experts. This property allows for a fast adaptation of the model to new speech data and avoids inconsistencies caused by incomplete inter-labeller agreement. Due to the partly text-based definition of prosodic structure, automatic preprocessing includes a signal-text alignment needed for subsequent linguistic interpretation. (2) In contrast to the more complex stylisation functions of the models mentioned above, the polynomial stylisation chosen in this study guarantees an analytic approximation and thus a biunique relation between the F0 to be modelled and its abstraction. This property is essential to partition the F0 stylisations into intonation classes based on their contour similarity as well as for later linguistic interpretation. At the same time the chosen polynomial order is capable of capturing F0-coded prominence and boundary behaviour. Linguistic interpretation: The linguistic interpretability of local contour classes was examined for the concepts significance, informational novelty, and utterance finality. The approach chosen here can be described as follows: first, by automatic linguistic corpus analyses hypotheses about possible relations between contour classes and linguistic concepts are generated. These hypotheses are subsequently tested by perception experiments. By these means a systematic linguistic anchoring of the model was achieved in form of a decision tree to predict the linguistically appropriate contour class. The adequacy of its predictions was assured by a further perception test. Conclusion: It has been shown, that it is possible to build a perceptually acceptable and linguistically interpretable representation of intonation in a purely data-driven manner. This bottom-up approach guarantees consistency and easy adaptability of the model to new data. Due to its simultaneous signal proximity and linguistic anchoring, it covers the entire chain from text to signal and therefore can be used for intonation analysis and generation on a linguistic as well as on a phonetic-acoustic level. It is qualified for employment in speech technology applications as well as in phonetic fundamental research to automatically analyse raw speech data.

Dokument bearbeiten Dokument bearbeiten