Project information
semANT - Semantic Document Exploration
- Project Identification
- DH23P03OVV060
- Project Period
- 3/2023 - 12/2027
- Investor / Pogramme / Project type
-
Ministry of Culture of the CR
- NAKI III: Applied research of national and cultural identity (2023-2030)
- MU Faculty or unit
- Faculty of Social Studies
- Keywords
- digital library; topic identification; semantic document search; content exploration; content visualization
- Cooperating Organization
-
The Moravian Library Brno
- Responsible person Boris Lehečka
- Responsible person Ing. Petr Žabička
- Responsible person Bc. Martina Dvořáková
- Responsible person Ing. Michal Hradiš, Ph.D.
- Responsible person doc. RNDr. Pavel Smrž, Ph.D.
Czech libraries and archives contain a huge number of digitized documents. The possibilities of their online presentation and search have been improving significantly in recent years. A large part of modern printed documents is already processed by OCR and therefore fully searchable. Also, there are tools for automatic transcription of old prints and handwritten documents. Their complete transcription is now only a matter of time.
However, the full-text search used in library systems is the simplest possible. It can work with different forms of a word, but not with the meaning. Thus, finding documents on a particular topic is very laborious. In contrast, current web search engines work with the words' meanings, making it possible to find texts that are relevant to the topic searched, though not containing the exact search term.
The main goal of this project is therefore to improve the searchability of the full-text representation of digitized documents at the level of text meaning and to improve the possibilities of natural navigation between related documents. We will provide users with a semantically enhanced full-text search, the possibility to search by text segments (e.g., paragraphs) and to specify the topic of interest at the same time. The system will work with automatically identified topics but will allow users to define their own topics based on examples.
The identification of topics will also be used to visualize the frequency of their occurrences and mutual interactions. Thus, it will be possible to track the evolution of topics over time, their continuity and transformation, or their connection to known named entities such as places and persons.
The results of the project will be used both by the general public for routine work with library systems and by the scientific community for enhanced text analysis. Also, we hope that parts of the project will find application in software for contemporary media and social networks analysis.
Sustainable Development Goals
Masaryk University is committed to the UN Sustainable Development Goals, which aim to improve the conditions and quality of life on our planet by 2030.