Project information
Very Large Language Corpora and Their Automatic Analysis

Project Identification

GA405/03/0913

Project Period

1/2003 - 12/2005

Investor / Pogramme / Project type

Czech Science Foundation

Standard Projects

MU Faculty or unit

Faculty of Informatics

prof. PhDr. Karel Pala, CSc.

Keywords

Very Large Corpora; Natural Language Processing; Statistical Methods in NLP

Cooperating Organization

Charles University

Responsible person prof. RNDr. Jan Hajič, Dr.

Language corpora are an indispensable part of current linguistic research. They are used for various purposes, from simple lookup for particular words to sophisticated use for automatic computer training in statistical language modeling or automatic analysis at various levels performed fully automatically on a computer. Usability of both monolingual as well as multilingual and spoken language corpora is substantially enhanced if the language material contained in them is linguistically analyzed. Annotation can reflect both the form and the function of linguistic units in their context. The primary goal of the project is to enhance our understanding of the natural language system in general and Czech in particular, and to develop and/or enhance statistical machine learning and symbolical methods (and their combinations) in order to be able to automatically analyze large quantities of naturally occurring texts, whether they are written or spoken. Results of previous projects in the field will be used, especially existing data (texts) and methodology. The role of very large corpora is twofold, both as a source of automatically acquirable language information, and as a target for application of the methods of automatic analysis developed in the course of the project (mainly for lexicographic purposes and also for the purpose of further linguistic studies). The results of the project will be published, including software tools and data.

Publications

Total number of publications: 3

2004

Corpus Analysis for Lexical Database Construction: A Case of Russian and Czech Wordnets

SMRŽ Pavel SINOPALNIKOVA Anna

Article in Proceedings

Proceedings of the 33th International Conference on Linguistics, year: 2004
Grammatical Heads Optimized for Parsing and Their Comparison with Linguistic Intuition

KADLEC Vladimír SMRŽ Pavel

Article in Proceedings

Proceedings of the Seventh International Conference on Text, Speech and Dialogue, TSD 2004, year: 2004
Syntactic analysis of natural languages based on context free grammar backbone

KADLEC Vladimír SMRŽ Pavel

Article in Proceedings

Proceedings of the 21th Workshop on Information Technologies, MIS 2004, year: 2004

10 reasons why you will fall in love with MU

Ask our ambassador

Read about research at MU

Project information
Very Large Language Corpora and Their Automatic Analysis

Publications

2004

Corpus Analysis for Lexical Database Construction: A Case of Russian and Czech Wordnets

Grammatical Heads Optimized for Parsing and Their Comparison with Linguistic Intuition

Syntactic analysis of natural languages based on context free grammar backbone

10 reasons why you will fall in love with MU

Ask our ambassador

Read about research at MU

Project informationVery Large Language Corpora and Their Automatic Analysis

Publications

2004

Corpus Analysis for Lexical Database Construction: A Case of Russian and Czech Wordnets

Grammatical Heads Optimized for Parsing and Their Comparison with Linguistic Intuition

Syntactic analysis of natural languages based on context free grammar backbone

Project information
Very Large Language Corpora and Their Automatic Analysis