Options
Cleaning the Europarl Corpus for Linguistic Applications
Abstract
We discovered several recurring errors in the current version of the Europarl Corpus originating both from the web site of the European Parliament and the corpus compilation based thereon. The most frequent error was incompletely extracted metadata leaving non-textual fragments within the textual parts of the corpus files. This is, on average, the case for every second speaker change. We not only cleaned the Europarl Corpus by correcting several kinds of errors, but also aligned the speakers’ contributions of all available languages and compiled every- thing into a new XML-structured corpus. This facilitates a more sophisticated selection of data, e.g. querying the corpus for speeches by speakers of a particular political group or in particular language combinations.
Publikationstyp
ConferencePaper
Autor*in • •
Graën, Johannes
Batinic, Dolores
Volk, Martin
Erscheinungsdatum
2014
Fachbereich
Institut / Einrichtung
Erschienen in
Proceedings of the 12th edition of the KONVENS conference
Erste Seite
222
Letzte Seite
227
URN
urn:nbn:de:gbv:hil2-opus-2857
HilPub Permalink
Dateien p040.pdf (157.17 KB)
Main Conference Proceedings of the 12th Konvens 2014