Words’ Burstiness in Language Models

Informace o publikaci

Autoři	RYCHLÝ Pavel
Rok publikování	2011
Druh	Článek ve sborníku
Konference	Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://nlp.fi.muni.cz/raslan/2011/paper17.pdf
Obor	Jazykověda
Klíčová slova	Burstiness; Language models; Words' probability
Popis	Good estimation of the probability of a single word is a crucial part of language modelling. It is based on raw frequency of the word in a training corpus. Such computation is a good estimation for functional words and most very frequent words, but it is a poor estimation for most content words because of words' tendency to occur in clusters. This paper provides an analysis of words' burstiness and propose a new unigram language model which handles bursty words much better. The evaluation of the model on two data sets shows consistently lower perplexity and cross-entropy in the new model.
Související projekty:	Právní e-slovník - PES Temporální aspekty znalostí a informací

Jak na přijímačky