From Complex Sentences to a Formal Semantic Representation using Syntactic Text Simplification and Open Information Extraction

  • Sentences that present a complex linguistic structure act as a major stumbling block for Natural Language Processing (NLP) applications whose predictive quality deteriorates with sentence length and complexity. The task of Text Simplification (TS) may remedy this situation. It aims to modify sentences in order to make them easier to process, using a set of rewriting operations, such as reordering, deletion or splitting. These transformations are executed with the objective of converting the input into a simplified output, while preserving its main idea and keeping it grammatically sound. State-of-the-art syntactic TS approaches suffer from two major drawbacks: first, they follow a very conservative approach in that they tend to retain the input rather than transforming it, and second, they ignore the cohesive nature of texts, where context spread across clauses or sentences is needed to infer the true meaning of a statement. To address these problems, we present a discourse-aware TS framework that is able to split and rephrase complexSentences that present a complex linguistic structure act as a major stumbling block for Natural Language Processing (NLP) applications whose predictive quality deteriorates with sentence length and complexity. The task of Text Simplification (TS) may remedy this situation. It aims to modify sentences in order to make them easier to process, using a set of rewriting operations, such as reordering, deletion or splitting. These transformations are executed with the objective of converting the input into a simplified output, while preserving its main idea and keeping it grammatically sound. State-of-the-art syntactic TS approaches suffer from two major drawbacks: first, they follow a very conservative approach in that they tend to retain the input rather than transforming it, and second, they ignore the cohesive nature of texts, where context spread across clauses or sentences is needed to infer the true meaning of a statement. To address these problems, we present a discourse-aware TS framework that is able to split and rephrase complex English sentences within the semantic context in which they occur. By generating a fine-grained output with a simple canonical structure that is easy to analyze by downstream applications, we tackle the first issue. For this purpose, we decompose a source sentence into smaller units by using a linguistically grounded transformation stage. The result is a set of selfcontained propositions, with each of them presenting a minimal semantic unit. To address the second concern, we suggest not only to split the input into isolated sentences, but to also incorporate the semantic context in the form of hierarchical structures and semantic relationships between the split propositions. In that way, we generate a semantic hierarchy of minimal propositions that benefits downstream Open Information Extraction (IE) tasks. To function well, the TS approach that we propose requires syntactically well-formed input sentences. It targets generalpurpose texts in English, such as newswire or Wikipedia articles, which commonly contain a high proportion of complex assertions. In a second step, we present a method that allows state-of-the-art Open IE systems to leverage the semantic hierarchy of simplified sentences created by our discourseaware TS approach in constructing a lightweight semantic representation of complex assertions in the form of semantically typed predicate-argument structures. In that way, important contextual information of the extracted relations is preserved that allows for a proper interpretation of the output. Thus, we address the problem of extracting incomplete, uninformative or incoherent relational tuples that is commonly to be observed in existing Open IE approaches. Moreover, assuming that shorter sentences with a more regular structure are easier to process, the extraction of relational tuples is facilitated, leading to a higher coverage and accuracy of the extracted relations when operating on the simplified sentences. Aside from taking advantage of the semantic hierarchy of minimal propositions in existing Open IE Abstract approaches, we also develop an Open IE reference system, Graphene. It implements a relation extraction pattern upon the simplified sentences. The framework we propose is evaluated within our reference TS implementation DisSim. In a comparative analysis, we demonstrate that our approach outperforms the state of the art in structural TS both in an automatic and a manual analysis. It obtains the highest score on three simplification datasets from two different domains with regard to SAMSA (0.67, 0.57, 0.54), a recently proposed metric targeted at automatically measuring the syntactic complexity of sentences which highly correlates with human judgments on structural simplicity and grammaticality. These findings are supported by the ratings from the human evaluation, which indicate that our baseline implementation DisSim returns fine-grained simplified sentences that achieve a high level of syntactic correctness and largely preserve the meaning of the input. Furthermore, a comparative analysis with the annotations contained in the RST Discourse Treebank (RST-DT) reveals that we are able to capture the contextual hierarchy between the split sentences with a precision of approximately 90% and reach an average precision of almost 70% for the classification of the rhetorical relations that hold between them. Finally, an extrinsic evaluation shows that when applying our TS framework as a pre-processing step, the performance of state-ofthe-art Open IE systems can be improved by up to 32% in precision and 30% in recall of the extracted relational tuples. Accordingly, we can conclude that our proposed discourse-aware TS approach succeeds in transforming sentences that present a complex linguistic structure into a sequence of simplified sentences that are to a large extent grammatically correct, represent atomic semantic units and preserve the meaning of the input. Moreover, the evaluation provides sufficient evidence that our framework is able to establish a semantic hierarchy between the split sentences, generating a fine-grained representation of complex assertions in the form of hierarchically ordered and semantically interconnected propositions. Finally, we demonstrate that state-of-the-art Open IE systems benefit from using our TS approach as a pre-processing step by increasing both the accuracy and coverage of the extracted relational tuples for the majority of the Open IE approaches under consideration. In addition, we outline that the semantic hierarchy of simplified sentences can be leveraged to enrich the output of existing Open IE systems with additional meta information, thus transforming the shallow semantic representation of state-of-the-art approaches into a canonical context-preserving representation of relational tuples.show moreshow less

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Christina Niklaus
URN:urn:nbn:de:bvb:739-opus4-10540
Advisor:Siegfried Handschuh, Ralf Schenkel
Document Type:Doctoral Thesis
Language:English
Year of Completion:2022
Date of Publication (online):2022/03/29
Date of first Publication:2022/03/29
Publishing Institution:Universität Passau
Granting Institution:Universität Passau, Fakultät für Informatik und Mathematik
Date of final exam:2022/03/25
Release Date:2022/03/29
Tag:Complex Sentences; Open Information Extraction; Semantic Representation; Syntactic Simplification; Text Simplification
Page Number:xxi, 301 Seiten
Institutes:Fakultät für Informatik und Mathematik
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
open_access (DINI-Set):open_access
Licence (German):License LogoCreative Commons - CC BY - Namensnennung 4.0 International