Publication details

Building Corpora for Stylometric Research

Authors

ŠVEC Ján RYGL Jan

Year of publication 2016
Type Article in Proceedings
Conference Text, Speech, and Dialogue - 19th International Conference
MU Faculty or unit

Faculty of Informatics

Citation
Doi http://dx.doi.org/10.1007/978-3-319-45510-5_3
Field Informatics
Keywords corpus; stylometry; authorship; crawler
Description Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.

You are running an old browser version. We recommend updating your browser to its latest version.

More info