SpiderLing

Suchomel,  Vít

Publication details

SpiderLing

Authors	SUCHOMEL Vít
Year of publication	2012
Type	Software
MU Faculty or unit	Faculty of Informatics
web	Domovská stránka software, zdrojový kód
Description	SpiderLing - a web spider for linguistics - is a software for obtaining large textual data from the web. The purpose of the obtained data is building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. SpiderLing focuses the crawling on the text rich parts of the web and maximizes the number of words in the final corpus per downloaded megabyte. The crawler was used for building large corpora in American Spanish, Arabic, Czech, English, Estonian, French, Hungarian, Japanese, Korean, Polish, Russian, Tajik, and six Turkic languages consisting of 74 billion words altogether. SpiderLing is an open source software, licensed under GNU General Public License and available for download (including the source code) at http://nlp.fi.muni.cz/trac/spiderling/. The research related to this piece of software was published in the following papers at international venues: 1.) Efficient Web Crawling for Large Text Corpora by Jan Pomikálek, Vít Suchomel at workshop ACL SIGWAC Web as Corpus (at conference WWW), April 2012; 2.) Large Corpora for Turkic Languages and Unsupervised Morphological Analysis by Vít Baisa, Vít Suchomel at workshop Language Resources and Technologies for Turkic Languages (at conference LREC), May 2012.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum

10 reasons why you will fall in love with MU

Ask our ambassador

Read about research at MU

SpiderLing