You are here:
Publication details
Web corpora for under-resourced languages
| Authors | |
|---|---|
| Year of publication | 2023 |
| Type | Article in Proceedings |
| Conference | Corpus Linguistics (CL2023), 2023 |
| MU Faculty or unit | |
| Citation | |
| web | The Twelfth International Corpus Linguistics Conference 2023 - Book of Abstracts |
| Keywords | under-resourced languages; web corpora; lexicography |
| Description | Abstract. Text corpora provide essential information for studying natural language phenomena, language learning, lexicography and for building language models. In the case of under-resourced languages, one has to make extra effort to find as many texts as possible to build corpora large enough for intended purposes. In this paper, we share our experience with compilation of corpora from the web in 25 under-resourced languages. Challenges we had to face in the process and opportunities of providing resources in languages that have recently increased in demand are discussed. Since 2017, we have crawled top level domains of countries where under-resourced languages are spoken. To compile a corpus, text was extracted from web pages, then boilerplate, duplicates and non-text were removed. As for European languages, we have e.g. over 140 million Irish tokens, 500 million Albanian tokens and 90 million Maltese tokens. Another set of corpora in Asian languages was built for a lexicography project to deliver 45,000 headword dictionaries based on 120 million Lao tokens and 230 million Tagalog tokens (and other collections). In that work, we found how many tokens is enough for a mid-sized bilingual dictionary. We have also experimented with African languages native to Ethiopia and Nigeria, getting e.g. over 30 million Amharic tokens. Although we have to deal with issues of languages less represented on the web such as a low genre and topic variety of sources and the fact that smartphones arguably lead to communication through visual media rather than text in some countries, building corpora in under-resourced languages provide many opportunities: There is no language research without a corpus. Language data is necessary for smartphone applications such as predictive writing. Finally, a modern dictionary enables the preservation of endangered languages and standardisation of languages spoken but not much used for written communication. |