SlamaTrain – Representative Training Dataset for Slavonic Large Language Models

Medveď,  Marek; Sabol,  Radoslav; Horák,  Aleš

Informace o publikaci

SlamaTrain – Representative Training Dataset for Slavonic Large Language Models

Autoři	MEDVEĎ Marek SABOL Radoslav HORÁK Aleš
Rok publikování	2024
Druh	Článek ve sborníku
Konference	Recent Advances in Slavonic Natural Language Processing, RASLAN 2024
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	http://nlp.fi.muni.cz/raslan/2024/paper13.pdf
Klíčová slova	Slama models; LLM; large language models; training; dataset
Přiložené soubory	SlamaTrain-_Representative_Training_Dataset_for_Slavonic_Large_Language_Models_25_33.pdf
Popis	The Slama project focuses on building a series of foundational language models for Slavonic languages. Even though the latest developmentyieldsanumberofnewlargepre-trainedandfine-tunedmodels,the main data source came from English-written websites. Therefore the majority of the training data that is used for language model development consists oftheEnglishlanguage.MultilinguallanguagemodelslikeLlama, GPT-4o,mT5,etc.arealsopredominantly(around80%)trainedontheEnglish language, even though they capture the structure of dozens of languages. In this paper, we detail the process of acquiring one of the largest training datasets for Czech, Slovak and other Slavonic languages. We started with huge multi-lingual datasets, extracted the mono-lingual data and joined them with other sources. The combined mono-lingual datasets were then cleaned, deduplicated and filtered for adult content. As a result, we have obtained 71 billion tokens for the Czech and Slovak languages suitable for the Slama language models training.
Související projekty:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy

Jak na přijímačky

Důležité termíny

Přečtěte si o výzkumu na MU

Jak na přijímačky

Důležité termíny

Přečtěte si o výzkumu na MU

SlamaTrain – Representative Training Dataset for Slavonic Large Language Models