You are here:
Publication details
Možnosti a meze korpusového výzkumu proprií
Title in English | Possibilities and limitations of corpus research on proper names |
---|---|
Authors | |
Year of publication | 2024 |
MU Faculty or unit | |
Citation | |
Description | In this presentation we would like to show the limits and possibilities of proprioid research based on our experience with Czech language corpora, taking into account the state of morphological marking used in the Czech environment. We will clarify how the individual steps of automatic morphological analysis affect the state of lemmatization and tagging in the case of proprias. We will touch upon the problem of tokenization and multi-word proprioids, the problem of completing the morphological dictionary in relation to proprioids, the peculiarities of flexion of proprioids and their homonymy with appellatives in relation to marking and disambiguation. We will point out the cases when it is not appropriate to rely on morphological tagging in research, and we will use concrete examples to show the distortion of research data caused by incorrect morphological tagging. We will outline ways to avoid bias in analyzed data. In the paper we will also show the possibilities of using different computational tools on concrete examples of onomastic research. We will present the differences in the use of data extraction from CNK, SketchEngine and Aranea corpora, and show the possibilities of more complex CQL queries in data classification. We will introduce the lesser known categories of tagging in Aranea corpora and show its effective use in onomastic data mining. |
Related projects: |