Web table integration and profiling for knowledge base augmentation


Lehmberg, Oliver


[img]
Vorschau
PDF
thesis.pdf - Veröffentlichte Version

Download (23MB)

URL: https://madoc.bib.uni-mannheim.de/52346
URN: urn:nbn:de:bsz:180-madoc-523468
Dokumenttyp: Dissertation
Erscheinungsjahr: 2019
Ort der Veröffentlichung: Mannheim
Hochschule: Universität Mannheim
Gutachter: Bizer, Christian
Datum der mündl. Prüfung: 26 September 2019
Sprache der Veröffentlichung: Englisch
Einrichtung: Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
Fachgebiet: 004 Informatik
Freie Schlagwörter (Englisch): web tables , data integration , knowledge base augmentation , data profiling
Abstract: HTML tables on web pages ("web tables") have been used successfully as a data source for several applications. They can be extracted from web pages on a large-scale, resulting in corpora of millions of web tables. But, until today only little is known about the general distribution of topics and specific types of data that are contained in the tables that can be found on the Web. But this knowledge is essential to understanding the potential application areas and topical coverage of web tables as a data source. Such knowledge can be obtained through the integration of web tables with a knowledge base, which enables the semantic interpretation of their content and allows for their topical profiling. In turn, the knowledge base can be augmented by adding new statements from the web tables. This is challenging, because the data volume and variety are much larger than in traditional data integration scenarios, in which only a small number of data sources is integrated. The contributions of this thesis are methods for the integration of web tables with a knowledge base and the profiling of large-scale web table corpora through the application of these methods. For this profiling, two corpora of 147 million and 233 million web tables, respectively, are created and made publicly available. These corpora are two of only three that are openly available for research on web tables. Their data profile reveals that most web tables have only very few rows, with a median of 6 rows per web table, and between 35% and 52% of all columns contain non-textual values, such as numbers or dates. These two characteristics have been mostly ignored in the literature about web tables and are addressed by the methods presented in this thesis. The first method, T2K Match, is an algorithm for semantic table interpretation that annotates web tables with classes, properties, and entities from a knowledge base. Other than most algorithms for these tasks, it is not limited to the annotation of columns that contain the names of entities. Its application to a large-scale web table corpus results in the most fine-grained topical data profile of web tables at the time of writing, but also reveals that small web tables cannot be processed with high quality. For such small web tables, a method that stitches them into larger tables is presented and shown to drastically improve the quality of the results. The data profile further shows that the majority of the columns in the web tables, where classes and entities can be recognised, have no corresponding properties in the knowledge base. This makes them candidates for new properties that can be added to the knowledge base. The current methods for this task, however, suffer from the oversimplified assumption that web tables only contain binary relations. This results in the extraction of incomplete relations from the web tables as new properties and makes their correct interpretation impossible. To increase the completeness, a method is presented that generates additional data from the context of the web tables and synthesizes n-ary relations from all web tables of a web site. The application of this method to the second large-scale web table corpus shows that web tables contain a large number of n-ary relations. This means that the data contained in web tables is of higher complexity than previously assumed.




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadaten-Export


Zitation


+ Suche Autoren in

+ Download-Statistik

Downloads im letzten Jahr

Detaillierte Angaben



Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail


Actions (login required)

Eintrag anzeigen Eintrag anzeigen