Character-based Language Model

Informace o publikaci

Autoři	BAISA Vít
Rok publikování	2014
Druh	Článek ve sborníku
Konference	Eighth Workshop on Recent Advances in Slavonic Natural Language Processing
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://nlp.fi.muni.cz/raslan/2014/6.pdf
Obor	Jazykověda
Klíčová slova	language model; suffix array; LCP; trie; character-based; random text generator; corpus
Popis	Language modelling and also other natural language processing tasks are usually based on words. I present here a more general yet simpler approach to language modelling using much smaller units of text data: character-based language model (CBLM). In this paper I describe the underlying data structure of the model, evaluate the model using standard measures (entropy, perplexity). As a proof-of-concept and an extrinsic evaluation I present also a random sentence generator based on this model.
Související projekty:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

Jak na přijímačky