Project information
Very Large Language Corpora and Their Automatic Analysis

Investor logo
Project Identification
GA405/03/0913
Project Period
1/2003 - 12/2005
Investor / Pogramme / Project type
Czech Science Foundation
MU Faculty or unit
Faculty of Informatics
Keywords
Very Large Corpora; Natural Language Processing; Statistical Methods in NLP
Cooperating Organization
Charles University

Language corpora are an indispensable part of current linguistic research. They are used for various purposes, from simple lookup for particular words to sophisticated use for automatic computer training in statistical language modeling or automatic analysis at various levels performed fully automatically on a computer. Usability of both monolingual as well as multilingual and spoken language corpora is substantially enhanced if the language material contained in them is linguistically analyzed. Annotation can reflect both the form and the function of linguistic units in their context. The primary goal of the project is to enhance our understanding of the natural language system in general and Czech in particular, and to develop and/or enhance statistical machine learning and symbolical methods (and their combinations) in order to be able to automatically analyze large quantities of naturally occurring texts, whether they are written or spoken. Results of previous projects in the field will be used, especially existing data (texts) and methodology. The role of very large corpora is twofold, both as a source of automatically acquirable language information, and as a target for application of the methods of automatic analysis developed in the course of the project (mainly for lexicographic purposes and also for the purpose of further linguistic studies). The results of the project will be published, including software tools and data.

Publications

Total number of publications: 3


You are running an old browser version. We recommend updating your browser to its latest version.

More info