Volltext-Downloads (blau) und Frontdoor-Views (grau)

Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora

  • This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Denis ArnoldORCiDGND, Bernhard FisseniGND, Paweł KamockiGND, Oliver SchonefeldGND, Marc KupietzGND, Thomas SchmidtORCiDGND
URN:urn:nbn:de:bsz:mh39-98129
URL:http://corpora.ids-mannheim.de/cmlc-2020.html
ISBN:979-10-95546-61-0
Parent Title (English):Proceedings of the LREC 2020 Workshop, Language Resources and Evaluation Conference, 11–16 May 2020, 8th Workshop on Challenges in the Management of Large Corpora (CMLC-8)
Publisher:European Language Resources Association
Place of publication:Paris
Editor:Piotr Bański, Adrien Barbaresi, Simon Clematide, Marc Kupietz, Harald Lüngen, Ines Pisetta
Document Type:Conference Proceeding
Language:English
Year of first Publication:2020
Date of Publication (online):2020/05/12
Publicationstate:Veröffentlichungsversion
Reviewstate:Peer-Review
Tag:format migration; legal issues; long-term archival; metadata
GND Keyword:Dateiformat; Korpus <Linguistik>; Langzeitarchivierung; Nutzungsrecht
First Page:1
Last Page:9
DDC classes:400 Sprache / 400 Sprache, Linguistik
Open Access?:ja
Leibniz-Classification:Sprache, Linguistik
Linguistics-Classification:Computerlinguistik
Linguistics-Classification:Korpuslinguistik
Program areas:S2: Forschungskoordination und –infrastrukturen
Licence (English):License LogoCreative Commons - Attribution-NonCommercial 4.0 International