Project information
Very Large Language Corpora and Their Automatic Analysis
- Project Identification
- GA405/03/0913
- Project Period
- 1/2003 - 12/2005
- Investor / Pogramme / Project type
-
Czech Science Foundation
- Standard Projects
- MU Faculty or unit
- Faculty of Informatics
- Keywords
- Very Large Corpora; Natural Language Processing; Statistical Methods in NLP
- Cooperating Organization
-
Charles University
- Responsible person prof. RNDr. Jan Hajič, Dr.
Language corpora are an indispensable part of current linguistic research. They are used for various purposes, from simple lookup for particular words to sophisticated use for automatic computer training in statistical language modeling or automatic analysis at various levels performed fully automatically on a computer. Usability of both monolingual as well as multilingual and spoken language corpora is substantially enhanced if the language material contained in them is linguistically analyzed. Annotation can reflect both the form and the function of linguistic units in their context. The primary goal of the project is to enhance our understanding of the natural language system in general and Czech in particular, and to develop and/or enhance statistical machine learning and symbolical methods (and their combinations) in order to be able to automatically analyze large quantities of naturally occurring texts, whether they are written or spoken. Results of previous projects in the field will be used, especially existing data (texts) and methodology. The role of very large corpora is twofold, both as a source of automatically acquirable language information, and as a target for application of the methods of automatic analysis developed in the course of the project (mainly for lexicographic purposes and also for the purpose of further linguistic studies). The results of the project will be published, including software tools and data.
Publications
Total number of publications: 3
2004
-
Corpus Analysis for Lexical Database Construction: A Case of Russian and Czech Wordnets
Proceedings of the 33th International Conference on Linguistics, year: 2004
-
Grammatical Heads Optimized for Parsing and Their Comparison with Linguistic Intuition
Proceedings of the Seventh International Conference on Text, Speech and Dialogue, TSD 2004, year: 2004
-
Syntactic analysis of natural languages based on context free grammar backbone
Proceedings of the 21th Workshop on Information Technologies, MIS 2004, year: 2004