You are here:
Publication details
Building Corpora for Stylometric Research
Authors | |
---|---|
Year of publication | 2016 |
Type | Article in Proceedings |
Conference | Text, Speech, and Dialogue - 19th International Conference |
MU Faculty or unit | |
Citation | |
Doi | http://dx.doi.org/10.1007/978-3-319-45510-5_3 |
Field | Informatics |
Keywords | corpus; stylometry; authorship; crawler |
Description | Authorship recognition, machine translation detection, pedophile identification and other stylometry techniques are daily used in applications for the most widely used languages. On the other hand, under-represented languages lack data sources usable for stylometry research. In this paper, we propose novel algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify data-cleaning techniques for purposes of stylometry field and add a heuristic layer to detect and extract valuable meta-information. The system was evaluated on Czech and Slovak web domains. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones. |