Web corpora for under-resourced languages

Suchomel,  Vít; Jakubíček,  Miloš; Matuška,  Ondřej

Publication details

Web corpora for under-resourced languages

Authors	SUCHOMEL Vít JAKUBÍČEK Miloš MATUŠKA Ondřej
Year of publication	2023
Type	Article in Proceedings
Conference	Corpus Linguistics (CL2023), 2023
MU Faculty or unit	Faculty of Informatics
Citation
web	The Twelfth International Corpus Linguistics Conference 2023 - Book of Abstracts
Keywords	under-resourced languages; web corpora; lexicography
Description	Abstract. Text corpora provide essential information for studying natural language phenomena, language learning, lexicography and for building language models. In the case of under-resourced languages, one has to make extra effort to find as many texts as possible to build corpora large enough for intended purposes. In this paper, we share our experience with compilation of corpora from the web in 25 under-resourced languages. Challenges we had to face in the process and opportunities of providing resources in languages that have recently increased in demand are discussed. Since 2017, we have crawled top level domains of countries where under-resourced languages are spoken. To compile a corpus, text was extracted from web pages, then boilerplate, duplicates and non-text were removed. As for European languages, we have e.g. over 140 million Irish tokens, 500 million Albanian tokens and 90 million Maltese tokens. Another set of corpora in Asian languages was built for a lexicography project to deliver 45,000 headword dictionaries based on 120 million Lao tokens and 230 million Tagalog tokens (and other collections). In that work, we found how many tokens is enough for a mid-sized bilingual dictionary. We have also experimented with African languages native to Ethiopia and Nigeria, getting e.g. over 30 million Amharic tokens. Although we have to deal with issues of languages less represented on the web such as a low genre and topic variety of sources and the fact that smartphones arguably lead to communication through visual media rather than text in some countries, building corpora in under-resourced languages provide many opportunities: There is no language research without a corpus. Language data is necessary for smartphone applications such as predictive writing. Finally, a modern dictionary enables the preservation of endangered languages and standardisation of languages spoken but not much used for written communication.

10 reasons why you will fall in love with MU

Ask our ambassador

Read about research at MU

Web corpora for under-resourced languages