Zde se nacházíte:
Informace o publikaci
Utok: The Fast Rule-based Tokenizer
Autoři | |
---|---|
Rok publikování | 2022 |
Druh | Článek ve sborníku |
Konference | Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022 |
Fakulta / Pracoviště MU | |
Citace | |
www | |
Klíčová slova | tokenizer; tokenization; text processing |
Popis | Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed. |
Související projekty: |