You are here:
Publication details
SpiderLing
Authors | |
---|---|
Year of publication | 2012 |
MU Faculty or unit | |
web | Domovská stránka software, zdrojový kód |
Description | SpiderLing - a web spider for linguistics - is a software for obtaining large textual data from the web. The purpose of the obtained data is building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. SpiderLing focuses the crawling on the text rich parts of the web and maximizes the number of words in the final corpus per downloaded megabyte. The crawler was used for building large corpora in American Spanish, Arabic, Czech, English, Estonian, French, Hungarian, Japanese, Korean, Polish, Russian, Tajik, and six Turkic languages consisting of 74 billion words altogether. SpiderLing is an open source software, licensed under GNU General Public License and available for download (including the source code) at http://nlp.fi.muni.cz/trac/spiderling/. The research related to this piece of software was published in the following papers at international venues: 1.) Efficient Web Crawling for Large Text Corpora by Jan Pomikálek, Vít Suchomel at workshop ACL SIGWAC Web as Corpus (at conference WWW), April 2012; 2.) Large Corpora for Turkic Languages and Unsupervised Morphological Analysis by Vít Baisa, Vít Suchomel at workshop Language Resources and Technologies for Turkic Languages (at conference LREC), May 2012. |
Related projects: |