gms | German Medical Science

64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie

08. - 11.09.2019, Dortmund

Don’t lose sight of the forest for the trees: Building a German medical text corpus for word embeddings using webcrawling

Meeting Abstract

  • Johanna Fiebeck - Medizinische Hochschule Hannover, Zentrum für Informationsmanagement, Hannover, Germany
  • Hinrich Boy Winther - Institut für Diagnostische und Interventionelle Radiologie, Imaging Unit im Clinical Research Center, Hannover, Germany
  • Frank Wacker - Institut für Diagnostische und Interventionelle Radiologie, Imaging Unit im Clinical Research Center, Hannover, Germany
  • Svetlana Gerbel - Medizinische Hochschule Hannover, Zentrum für Informationsmanagement, Hannover, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 64. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e.V. (GMDS). Dortmund, 08.-11.09.2019. Düsseldorf: German Medical Science GMS Publishing House; 2019. DocAbstr. 246

doi: 10.3205/19gmds101, urn:nbn:de:0183-19gmds1012

Published: September 6, 2019

© 2019 Fiebeck et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Semantic analysis of texts has growing more popular in recent years. Unsurprisingly that more scientists concentrate on medical texts for information extraction or data structuring tasks, since most medical information are written in the electronic health records (EHR). In Germany, some approaches were developed recently. Most of them are either rule-based or depend on internal text corpuses. Since there is no publicly available generic text corpus for medical information in Germany, it is neccessary to develop own embedding models and corpuses [1], [2].

Goal: We wanted to create a kind of medical text corpus which could be used for word embedding modelling and semantic text analysis in the healthcare context. This library should be generated as automatically as possible and should contain common medical terms and their definitions. The idea was to build a self-generating dictionary of synonyms for the radiology department to search for terms in radiological EHRs, specifically the so-called Weber fracture [3].

Methods:

  • Webscraping: We built a Python-based webcrawler and created some dictionaries out of common German web lexica for medical terms. The identified lexica were https://flexikon.doccheck.com/de/Spezial:Mainpage and https://www.med-kolleg.de/medizin-lexikon and https://www.wikipedia.org/, while concentrating on few, fixed categories. The webcrawler searched for lexicon-specific tags within the html content. The tags were stored in a crawling list which then generates a new url for scraping the next webpage and so on.
  • Article clearing process: Scraped articles were fetched in a python dictionary. After scraping was completed, articles with no or invalid content were filtered out.
  • Model training with Word2vec: Every lexicon was processed to a distinct corpus. The corpus pre-processing was carried out with the Python package gensim built-in functions. Theword embedding models were trained with Word2Vec and several training parameters. The quality of the word embeddings was tested with most-similar-requests of the predefined terms and their combinations, e.g. “Radiologie”, “Fraktur”, “Sprunggelenk”.

Results: Overall, 46,304 articles were scraped, where 1,641 Wikipedia articles were filtered out afterwards. The built corpuses vocabularies contained about 8 million terms each.

When comparing the models and the corpuses, the results were quite different between the trained models as well as between the corpuses. For example, “radiologie” is strongly associated with “nuklearmedizin” somehow in all dictionaries, but “fraktur” was associated with “ruptur” only once. Other associations varied within the models, like “proximale” or “kompression” or, more interestingly, “duverney” and “maisonneuve”, which are the denotations for well-known fractures.

Discussion and future prospects: The introduced approach for building a German medical text corpus for word embedding is a promising method and supplements text mining approaches in EHRs. Nevertheless, the processing and model training of the corpuses need to be improved. The quality of the word embedding might be improved by merging all three corpuses to a single one and by correcting the pre-processing pipeline according German language, like stemming, since the gensim function only is built for English texts. Additionally, scraping Wikipedia articles with fixed categories is not feasible for this kind of task, since the Wikipedia category structure is quite complicated to deal with automatically.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Krebs J, Krug M, Fette G, Dietrich G, Ertl M, Güder G, Puppe F, Kaspar M. Identifying Heart Failure Patients by Medical Text Classification. Studies in health technology and informatics. 2019;258:251. DOI: 10.3233/978-1-61499-959-1-251 External link
2.
Hahn U, Matthies F, Lohr C, Löffler M. 3000PA-Towards a National Reference Corpus of German Clinical Language. Stud Health Technol Inform. 2018;247:26–30.
3.
Fiebeck J, Laser H, Winther HB, Gerbel S. Leaving no stone unturned: Using machine learning based approaches for information extraction from full texts of a research data warehouse. In: Auer S, Vidal ME, editors. Data integration in the life sciences. Cham: Springer; 2019. p. 50–58. (Lecture Notes in Computer Science; 11371). DOI: 10.1007/978-3-030-06016-9_5 External link