Named entity extraction tools designed for recognizing named entities in texts written in standard language (e.g., news stories or legal texts) have been shown to be inadequate for user-generated textual content (e.g., tweets, forum posts). In this work, we propose a supervised approach to named entity recognition and classification for Croatian tweets. We compare two sequence labelling models: a hidden Markov model (HMM) and conditional random fields (CRF). Our experiments reveal that CRF is the best model for the task, achieving a very good performance of over 87% micro-averaged F1 score. We analyse the contributions of different feature groups and influence of the training set size on the performance of the CRF model.
Zusätzliche Informationen:
Online-Ressource
Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.
Dieser Datensatz wurde nicht während einer Tätigkeit an der Universität Mannheim veröffentlicht, dies ist eine Externe Publikation.