Building Corpora for Stylometric Research

Publication details

Authors	ŠVEC Ján RYGL Jan
Year of publication	2016
Type	Article in Proceedings
Conference	Text, Speech, and Dialogue - 19th International Conference
MU Faculty or unit	Faculty of Informatics
Citation
Doi	http://dx.doi.org/10.1007/978-3-319-45510-5_3
Field	Informatics
Keywords	corpus; stylometry; authorship; crawler
Description	Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

10 reasons why you will fall in love with MU