Slavonic Corpus for Stylometry Research

Informace o publikaci

Autoři	ŠVEC Ján RYGL Jan
Rok publikování	2015
Druh	Článek ve sborníku
Konference	Proceedings of Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015.
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	conference page article
Obor	Informatika
Klíčová slova	stylometry; slavonic corpus; web structure detection; corpora building
Popis	Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.
Související projekty:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

Studijní programy