Discovering Structural Similarities in Narrative Texts using Event Alignment Algorithms

Reiter, Nils

Preview

PDF, English - main document
Download (852kB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00017042
URN: urn:nbn:de:bsz:16-heidok-170421
URL: http://www.ub.uni-heidelberg.de/archiv/17042

Abstract

This thesis is about the discovery of structural similarities across narrative texts. We will describe a method that is based on event alignments created automatically on automatically preprocessed texts. This opens up a path to large-scale empirical research on structural similarities across texts. Structural similarities are of interest for many areas in the humanities and social sciences. We will focus on folkloristics and research of rituals as application scenarios. Folkloristics researches folktales, i.e., tales that have been passed down orally for a long time. Similarities across different folktales have been observed, both at the level of individual events (being abandoned in the woods) or participants (the gingerbread house) and structurally: Events do not happen at random, but in a certain order. Rituals are an omnipresent part of human behavior and are studied in ethnology, social sciences and history. Similarities across types of rituals have been observed and sparked a discussion about structural principles that govern the combination of individual ritual elements to rituals. As descriptions of rituals feature a lot of uncommon language constructions, we will also discuss methods of domain adaptation in order to adapt existing NLP components to the domain of rituals. We will mainly use supervised methods and employ retraining as a means for adaptation. This presupposes annotating small amounts of domain data. We will be discussing the following linguistic levels: Part of speech, chunking, dependency parsing, word sense disambiguation, semantic role labeling and coreference resolution. On all levels, we have achieved improvements. We will also describe how these annotation levels are brought together in a single, integrated discourse representation that is the basis for further experiments.

In order to discover structural similarities, we employ three different alignment algorithms and use them to align semantically similar events. Sequence alignment (Needleman-Wunsch) is a classic algorithm with limited capabilities. A graph-based event alignment system that has been developed for newspaper texts will be used in comparison. As a third algorithm, we employ Bayesian model merging, which induces a hidden Markov model, from which we extract an alignment. We will evaluate the algorithms in two experiments. In the first experiment, we evaluate against a gold standard of aligned descriptions of rituals. Bayesian model merging achieves the best results, measured using the Blanc metric. Due to difficulties in creating an event alignment gold standard, the second experiment is based on cluster induction. Although this is not a strict evaluation of structural similarities, it gives some insight into the behavior of the algorithms. We induce a document similarity measure from the generated alignments and use this measure to cluster the documents. The clustering is then compared against a gold standard classification of documents from both scenarios. In this experiment, the lemma alignment baseline achieves the best numerical performance on folktales (but as it aligns lemmas instead of event representations, its expressiveness is limited), followed by predicate alignment, Bayesian model merging and Needleman-Wunsch. On descriptions of rituals, the predicate alignment algorithm outperforms all shallow and more specialized baselines. Shallow measures of semantic similarities of texts outperform the alignment-based algorithms on folktales, but they do not allow the exact localization of similarities. Finally, we present a graph-based algorithm that ranks events according to their participation in structurally similar regions across documents. This allows us to direct researchers from humanities to interesting cases, which are worth manual inspection. Because in digital humanities scenarios, the accessibility of results to researchers from humanities is of utmost importance, we close the thesis with a showcase scenario in which we analyze descriptions of rituals using the alignment, clustering and event ranking algorithms we have described before. We will show in this showcase how results can be visualized and interpreted by researchers of rituals.

Document type:	Dissertation
Supervisor:	Frank, Prof. Dr. Anette
Date of thesis defense:	27 November 2013
Date Deposited:	25 Jun 2014 09:48
Date:	2014
Faculties / Institutes:	Neuphilologische Fakultät > Institut für Computerlinguistik
DDC-classification:	004 Data processing Computer science
Controlled Keywords:	Computerlinguistik, Narrativität, Digital Humanities