Multi-word tokenization for natural language processing

Michelbacher, Lukas

Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.18419/opus-3208

Autor(en):	Michelbacher, Lukas
Titel:	Multi-word tokenization for natural language processing
Sonstige Titel:	Mehrworttokenisierung für maschinelle Sprachverarbeitung
Erscheinungsdatum:	2013
Dokumentart:	Dissertation
URI:	http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-87466 http://elib.uni-stuttgart.de/handle/11682/3225 http://dx.doi.org/10.18419/opus-3208
Zusammenfassung:	Sophisticated natural language processing (NLP) applications are entering everyday life in the form of translation services, electronic personal assistants or open-domain question answering systems. The more voice-operated applications like these become commonplace, the more expectations of users are raised to communicate with these services in unrestricted natural language, just as in a normal conversation. One obstacle that hinders computers to understand unrestricted natural language is that of collocations, combinations of multiple words that have idiosyncratic properties, for example, red tape, kick the bucket or there's no use crying over spilled milk. Automatic processing of collocations is nontrivial because these properties cannot be predicted from the properties of the individual words. This thesis addresses multi-word units (MWUs), collocations that appear in the form of complex noun phrases. Complex noun phrases are important for NLP because they denote real-world entities and concepts and are often used for specialized vocabulary such as scientific or legal terms. Virtually every NLP system uses tokenization, the partitioning of textual input into meaningful units, or tokens, as part of preprocessing. Traditionally, tokenization does not deal with MWUs which leads to early errors and error propagation in subsequent NLP tasks, resulting in poorer quality of NLP applications. The central idea presented in this thesis is the proposition of multi-word tokenization (MWT), MWU-aware tokenization as a preprocessing step for NLP systems. The goal of this thesis is to drive research towards NLP applications that understand unrestricted natural language. Our main contributions cover two aspects of MWT. First, we conducted fundamental research into asymmetric association, the phenomenon that lexical association from one component of an MWU to another can be stronger in one direction than in the other. This property has not been investigated deeply in the literature. We position asymmetric association in the broader context of different types of word association and collected human syntagmatic associations using a novel experiment setup. We measured asymmetric association in human syntagmatic production and showed that it is a phenomenon that is indicative of MWUs. Furthermore, we created corpus-based asymmetric association measures and showed that asymmetry in word combinations can be predicted automatically with high accuracy using these measures. Second, we present an implementation of MWT where we cast MWU recognition as a classification problem. We built an MWU classifier whose features address properties of MWUs. In particular, we targeted semantic non-compositionality, a phenomenon of unpredictable meaning shifts that occurs in many MWUs. In order to detect meaning shifts, we used features of contextual similarity based on distributional semantics. We found that context features significantly improve MWU classification accuracy but that there are unreliable aspects in the workings of such features. Additionally, we integrated MWT into an information retrieval system and showed that incorporating MWU information improves retrieval performance. Hoch entwickelte Anwendungen der maschinellen Sprachverarbeitung (NLP, von engl. natural language processing) erhalten Einzug in das tägliche Leben in Form automatischer Übersetzungs-, allgemeiner Frage-Antwort-Systeme sowie elektronischer persönlicher Assistenten. Mit der Etablierung sprachgesteuerter Anwendungen steigen die Erwartungen der Benutzer, diese Anwendungen mit unbeschränkter natürlicher Sprache zu bedienen, sich also ganz normal mit ihnen zu unterhalten. Ein Hindernis, das es Computern erschwert, uneingeschränkte natürliche Sprache zu verstehen, sind Kollokationen, Kombinationen mehrerer Wörter mit besonderen Eigenschaften, wie zum Beispiel toller Hecht, den Löffel abgeben oder wo gehobelt wird, da fallen Späne. Die Automatische Verarbeitung von Kollokationen ist ein nicht-triviales Problem, weil deren besondere Eigenschaften nicht aus den Eigenschaften ihrer Bestandteile vorhergesagt werden können. Die vorliegende Arbeit beschäftigt sich mit Mehrworteinheiten (MWUs, von engl. multi-word unit), Kollokationen, die als komplexe Nominalphrasen auftreten. Komplexe Nominalphrasen sind für NLP von besonderer Bedeutung, da sie Objekte und Konzepte der realen Welt bezeichnen und häufig in Fachbegriffen auftreten, so zum Beispiel in wissenschaftlichen oder juristischen Begriffen. Beinahe jedes NLP-System beruht auf dem Vorverarbeitungsschritt der Tokenisierung, der Unterteilung textueller Daten in bedeutungstragende Einheiten, sogenannter Tokens. Für gewöhnlich beinhaltet Tokenisierung keine Behandlung von Mehrworteinheiten, was zu frühen Fehlern, Fehlerfortpflanzung und schlechterer Qualität in NLP-Anwendungen führt. In der vorliegenden Arbeit schlagen wir Mehrwort-Tokenisierung (MWT, engl. multi-word tokenization) vor, Tokenisierung, die Mehrworteinheiten erkennt. Ziel unserer Arbeit ist, Forschung voranzutreiben, die es Anwendungen ermöglicht, uneingeschränkte natürliche Sprache verstehen. Die Hauptbeiträge decken zwei Bereiche ab, die für MWT relevant sind. Erstens präsentieren wir Grundlagenforschung zu asymmetrischer Assoziation, dem Phänomen, das lexikalische Assoziation zwischen den Bestandteilen von MWUs unterschiedlich stark ausgeprägt sein kann. Diese Eigenschaft wurde bisher in der Literatur noch nicht tiefer gehend behandelt. Zum einen verorten wir asymmetrische Assoziation in einem breiteren Kontext verschiedener Typen von Wortassoziationen, zum anderen haben wir menschliche syntagmatische Assoziationen in einem dafür neu entwickelten Experiment gemessen. Wir zeigen, dass asymmetrische Assoziation ein Indikator dafür ist, dass eine Phrase eine MWUs ist. Außerdem haben wir korpus-basierte Assoziationsmaße entwickelt und gezeigt, dass Asymmetrie in Wortpaaren automatisch und mit hoher Genauigkeit vorhergesagt werde kann. Zweitens präsentieren wir eine MWT-Implementierung, in der MWU-Erkennung als Klassifikationsproblem definiert wird. Dazu haben wir einen Klassifikator entwickelt, dessen Features auf MWU-Eigenschaften zugeschnitten sind. Dabei zielen wir insbesondere auf Nicht-Kompositionalität ab, das Phänomen unvorhersehbarer Bedeutungsverschiebungen, das in vielen MWUs auftritt. Zur Erkennung von Bedeutungsverschiebungen benutzen wir Features kontextueller Ähnlichkeit, die auf distributioneller Semantik aufbauen. Wir zeigen, dass diese Features MWU-Klassifikation entscheidend verbessern, Aspekte ihrer Funktionsweise jedoch unzuverlässig sind. Darüber hinaus haben wir MWT in ein Information-Retrieval-System integriert und gezeigt, dass das Einbeziehen von MWU-Informationen die Leistung des Systems verbessert.
Enthalten in den Sammlungen:	05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Dateien zu dieser Ressource:

Datei	Beschreibung	Größe	Format
dissertation_michelbacher_druckversion.pdf		973,8 kB	Adobe PDF	Öffnen/Anzeigen

Zur Langanzeige

Alle Ressourcen in diesem Repositorium sind urheberrechtlich geschützt.

Universität Stuttgart

OPUS - Online Publikationen der Universität Stuttgart