Explaining existing and missing results over nested data in big data analytics systems

Diestelkämper, Ralf

Bitte benutzen Sie diese Kennung, um auf die Ressource zu verweisen: http://dx.doi.org/10.18419/opus-12052

Autor(en):	Diestelkämper, Ralf
Titel:	Explaining existing and missing results over nested data in big data analytics systems
Erscheinungsdatum:	2021
Dokumentart:	Dissertation
Seiten:	258
URI:	http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-120696 http://elib.uni-stuttgart.de/handle/11682/12069 http://dx.doi.org/10.18419/opus-12052
Zusammenfassung:	Debugging analytical queries in big data analytics systems, such as Apache Spark, Flink, or Hive is a tedious process, especially when large datasets with nested data are involved. To ease this debugging process, we present novel approaches to obtain explanations for existing and missing data in the query result based on a formal data and execution model that faithfully captures the execution semantics of big data analytics systems to provide practically meaningful explanations. These explanations describe why data are present or absent from the result. Our first contribution is a novel, distributed, and scalable algorithm that matches tree-patterns on nested data in big data analytics systems. It enables us to precisely address and query nested data values and arbitrary combinations of them. We leverage this tree-pattern matching algorithm to request explanations for queries over large, nested data. The algorithm matches the pattern onto the data in two steps. It computes matches on the schema in the first step and applies these matches on the data values in a second step. Hence, it avoids complex global state that prevents other state-of-the-art algorithms to scale horizontally on large compute clusters and dataset sizes. In addition to the tree-pattern matching algorithm, we leverage provenance to find the explanations. Provenance describes the origins and derivation of the result data. To provide explanations for existing data, we introduce the novel structural provenance. It traces structural manipulations in addition to data dependencies through the query pipelines. It provides more comprehensive explanations than other existing approaches since it distinguishes between accessed and manipulated data at the granaluratity of individual nested attributes. We define formal capture rules for the structural provenance that extend our execution model. Capturing the strucural provenance according to these rules imposes a high runtime overhead. Thus, we contribute the Pebble algorithm that implements an optimized, lightweight structural provenance to scale to large, nested datasets. Pebble’s explanations enable novel use-cases beyond debugging, such as finding data-usage patterns or fine-grained auditing. Furthermore, we contribute a novel approach to query-based explanations for missing data in a query result. Query-based explanations pinpoint operators in the query that prevent expected data from appearing in the result. This data is called missing data or missing answer. Our approach is the first to support nested data and to consider operators that modify the schema and structure of the data such as the nesting or projection operator as potential causes of missing answers. Additionally, it accounts for mistakenly referenced attributes in the query. Hence, our explanations apply to a wider range of datasets and to novel error scenarios compared to existing, provenance-based solutions. Our approach extends these solutions with reparameterizations. Reparameterizations describe parameter modifications in query operators. We formally introduce them based on our execution model and derive a formal definition of our query-based explanations. To efficiently compute the explanations over large, nested datasets, we propose a novel heuristic algorithm, called Breadcrumb. It applies two unique techniques: (i) It reasons about multiple schema alternatives to account for the mistakenly referenced attributes and (ii) it re-validates each intermediate result to check whether its data can contribute to the missing answer. That is necessary to provide correct explanations for nested data. We implement the tree-pattern matching algorithm, Pebble, and Breadcrumb in Apache Spark to show that each algorithm scales with increasing dataset sizes. Therefore, we run the algorithms on at least two nested real-world workloads of up to 500GB. We illustrate that tree-patterns simplify the query pipelines and show that Pebble and Breadcrumb provide more comprehensive explanations than other state-of-the-art solutions which enable novel use-cases.
Enthalten in den Sammlungen:	05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Dateien zu dieser Ressource:

Datei	Beschreibung	Größe	Format
thesis_ralf_diestelkaemper.pdf		1,9 MB	Adobe PDF	Öffnen/Anzeigen

Zur Langanzeige

Alle Ressourcen in diesem Repositorium sind urheberrechtlich geschützt.

Universität Stuttgart

OPUS - Online Publikationen der Universität Stuttgart