Better Low-Resource Machine Translation with Smaller Vocabularies

Signoroni,  Edoardo; Rychlý,  Pavel

Publication details

Better Low-Resource Machine Translation with Smaller Vocabularies

Authors	SIGNORONI Edoardo RYCHLÝ Pavel
Year of publication	2024
Type	Article in Proceedings
Conference	Text, Speech, and Dialogue
MU Faculty or unit	Faculty of Informatics
Citation
web	https://link.springer.com/chapter/10.1007/978-3-031-70563-2_15
Doi	http://dx.doi.org/10.1007/978-3-031-70563-2_15
Keywords	Low-resource;Neural Machine Translation;Tokenization
Attached files	tsd1253.pdf
Description	Data scarcity is still a major challenge in machine translation. The performance of state-of-the-art deep learning architectures, such as the Transformers, for under-resourced languages is well below the one for high-resourced languages. This precludes access to information for millions of speakers across the globe. Previous research has shown that the Transformer is highly sensitive to hyperparameters in low-resource conditions. One such parameter is the size of the subword vocabulary of the model. In this paper, we show that using smaller vocabularies, as low as 1k tokens, instead of the default value of 32k, is preferable in a diverse array of low-resource conditions. We experiment with different sizes on English-Akkadian, Lower Sorbian-German, English-Manipuri, to obtain models that are faster to train, smaller, and better performing than the default setting. These models achieve improvements of up to 322% ChrF score, while being up to 66% smaller and up to 17% faster to train.
Related projects:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy Using artificial intelligence techniques for data processing, complex analysis and visualization of large-scale data

10 reasons why you will fall in love with MU

Ask our ambassador

Read about research at MU

10 reasons why you will fall in love with MU

Ask our ambassador

Read about research at MU

Better Low-Resource Machine Translation with Smaller Vocabularies