Publication details

Web corpora for under-resourced languages

Authors

SUCHOMEL Vít JAKUBÍČEK Miloš MATUŠKA Ondřej

Year of publication 2023
Type Article in Proceedings
Conference Corpus Linguistics (CL2023), 2023
MU Faculty or unit

Faculty of Informatics

Citation
web The Twelfth International Corpus Linguistics Conference 2023 - Book of Abstracts
Keywords under-resourced languages; web corpora; lexicography
Description Abstract. Text corpora provide essential information for studying natural language phenomena, language learning, lexicography and for building language models. In the case of under-resourced languages, one has to make extra effort to find as many texts as possible to build corpora large enough for intended purposes. In this paper, we share our experience with compilation of corpora from the web in 25 under-resourced languages. Challenges we had to face in the process and opportunities of providing resources in languages that have recently increased in demand are discussed. Since 2017, we have crawled top level domains of countries where under-resourced languages are spoken. To compile a corpus, text was extracted from web pages, then boilerplate, duplicates and non-text were removed. As for European languages, we have e.g. over 140 million Irish tokens, 500 million Albanian tokens and 90 million Maltese tokens. Another set of corpora in Asian languages was built for a lexicography project to deliver 45,000 headword dictionaries based on 120 million Lao tokens and 230 million Tagalog tokens (and other collections). In that work, we found how many tokens is enough for a mid-sized bilingual dictionary. We have also experimented with African languages native to Ethiopia and Nigeria, getting e.g. over 30 million Amharic tokens. Although we have to deal with issues of languages less represented on the web such as a low genre and topic variety of sources and the fact that smartphones arguably lead to communication through visual media rather than text in some countries, building corpora in under-resourced languages provide many opportunities: There is no language research without a corpus. Language data is necessary for smartphone applications such as predictive writing. Finally, a modern dictionary enables the preservation of endangered languages and standardisation of languages spoken but not much used for written communication.

You are running an old browser version. We recommend updating your browser to its latest version.

More info