![Studijní programy](https://cdn.muni.cz/media/3757910/studijni-programy-student-jde-chodbou-masarykova-univerzita.jpg?mode=crop¢er=0.5,0.5&rnd=133754493890000000&heightratio=0.5&width=278)
Zde se nacházíte:
Informace o publikaci
Slavonic Corpus for Stylometry Research
Autoři | |
---|---|
Rok publikování | 2015 |
Druh | Článek ve sborníku |
Konference | Proceedings of Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015. |
Fakulta / Pracoviště MU | |
Citace | |
www | |
Obor | Informatika |
Klíčová slova | stylometry; slavonic corpus; web structure detection; corpora building |
Popis | Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones. |
Související projekty: |