Automatic Evaluation of Dialogue-Systems Using Neural-Network Methods

Nedelchev, Rostislav

Volltext

View/Open (4.7MB)

Author

Nedelchev, Rostislav

ORCID

https://orcid.org/0000-0002-0209-6558

Type of Scholarly Publication

Dissertation

Date of Exam

19.04.2023

Date of Publication

05.06.2023

Advisor

Lehmann, Jens

Co-Referee

Bauckhage, Christian

Involved Institutions

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadata

Show full item record

Citable Links

Handle: https://hdl.handle.net/20.500.11811/10873
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-70983

Abstract

We usually interact with computers by means of specialized tools that are not as common as the language humans use. This has motivated researchers for already several decades to develop algorithms that enable interfacing with computer systems using natural language. This is especially prominent in recent times with the rise of voice assistants like Apple Siri or Amazon Alexa. However, the research and development of such systems is expensive in terms of human labor. The high expenses are especially prominent for the evaluation of such systems, which are very often evaluated by human annotators as a final stage and based on expensive development.
The focus of this thesis is to support the assessment of dialogue systems by creating automatic tools that support humans. Human conversations involve many intricacies that makes it difficult to develop an algorithm which could reliably but also informatively evaluate them. To put the challenge into context, one should consider the Turing test, which is a method of examination in artificial intelligence (AI) for ascertaining whether a computer is proficient of thinking like a human being. One of its key components is the ability to decide whether a conversation is natural. There are various criteria according to which a dialogue is evaluated, and hence, problems that is suffers from. In this work, we aim to detect of these problems.
In order to emulate human-like intelligence, we stand on the shoulders of techniques in Natural Language Processing (NLP), machine and deep learning (ML, DL). Since we have the goal to reduce human effort in the evaluation of dialogues, we focus on methods that can achieve our goal without the need of additionally annotated data:
1. We apply approaches from various problem domains. The thesis makes use of out-of-distribution (OoD), and anomaly detection approaches to treat low quality or problematic dialogue utterances as "unusual."
2. Despite being researched for a few decades, Language Models (LMs) became popular in the research only in the last few years. In our work, we show that they too can be used to evaluate dialogue quality.
3. Natural Language Processing as a field aims to teach various human-like language skills to computers, e.g. abilities like understanding whether two sentences are similar in meaning or whether a piece of text has a positive or negative sentiment. We show that these skills can be used as indirect indicators of conversation quality.
4. In addition, we show that dialogue systems can be evaluated not by means of reference, but "opinion." In other words, instead of asking them to generate a solution for a problem, we show that you can ask them to evaluate a reference solution and based on develop an understanding about the abilities of a dialogue system.
All of the proposed approaches in this thesis do not make use of supervision for dialogue evaluation. They manage to deliver insights using various perspectives that could potentially complement each other in an overall framework for assessing conversation quality.

Classification (DDC)

004 Informatik

Supplementary Research Data

https://doi.org/10.60507/FK2/FX37GD
https://doi.org/10.60507/FK2/MAVB6H

Zitiervorschlag
BibTeX

Nedelchev, Rostislav: Automatic Evaluation of Dialogue-Systems Using Neural-Network Methods. - Bonn, 2023. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-70983

@phdthesis{handle:20.500.11811/10873,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-70983,
author = {{Rostislav Nedelchev}},
title = {Automatic Evaluation of Dialogue-Systems Using Neural-Network Methods},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2023,
month = jun,
note = {We usually interact with computers by means of specialized tools that are not as common as the language humans use. This has motivated researchers for already several decades to develop algorithms that enable interfacing with computer systems using natural language. This is especially prominent in recent times with the rise of voice assistants like Apple Siri or Amazon Alexa. However, the research and development of such systems is expensive in terms of human labor. The high expenses are especially prominent for the evaluation of such systems, which are very often evaluated by human annotators as a final stage and based on expensive development.
The focus of this thesis is to support the assessment of dialogue systems by creating automatic tools that support humans. Human conversations involve many intricacies that makes it difficult to develop an algorithm which could reliably but also informatively evaluate them. To put the challenge into context, one should consider the Turing test, which is a method of examination in artificial intelligence (AI) for ascertaining whether a computer is proficient of thinking like a human being. One of its key components is the ability to decide whether a conversation is natural. There are various criteria according to which a dialogue is evaluated, and hence, problems that is suffers from. In this work, we aim to detect of these problems.
In order to emulate human-like intelligence, we stand on the shoulders of techniques in Natural Language Processing (NLP), machine and deep learning (ML, DL). Since we have the goal to reduce human effort in the evaluation of dialogues, we focus on methods that can achieve our goal without the need of additionally annotated data:
1. We apply approaches from various problem domains. The thesis makes use of out-of-distribution (OoD), and anomaly detection approaches to treat low quality or problematic dialogue utterances as "unusual."
2. Despite being researched for a few decades, Language Models (LMs) became popular in the research only in the last few years. In our work, we show that they too can be used to evaluate dialogue quality.
3. Natural Language Processing as a field aims to teach various human-like language skills to computers, e.g. abilities like understanding whether two sentences are similar in meaning or whether a piece of text has a positive or negative sentiment. We show that these skills can be used as indirect indicators of conversation quality.
4. In addition, we show that dialogue systems can be evaluated not by means of reference, but "opinion." In other words, instead of asking them to generate a solution for a problem, we show that you can ask them to evaluate a reference solution and based on develop an understanding about the abilities of a dialogue system.
All of the proposed approaches in this thesis do not make use of supervision for dialogue evaluation. They manage to deliver insights using various perspectives that could potentially complement each other in an overall framework for assessing conversation quality.},
url = {https://hdl.handle.net/20.500.11811/10873}
}

The following license files are associated with this item: