Greek web as corpus is a 100 million word collection of POS-tagged texts downloaded from the Internet, prepared by Milos Husak of Masaryk University, Brno, for Lexical Computing Ltd., in collaboration with the Greek publishers Patakis and the Greek software company Neurolingo.
The tokenization and Part-Of-Speech tagging uses the NeuroLingo Collection Analyzer, which provides the following information:
word lemma tag morph
NeuroLingo Collection Analyzer
The sketch grammar, used for the generation of Greek word sketches and distributional thesaurus, was developed by Mavina Pantazara and Christos Tsalidis of Neurolingo.
The corpus is divided into documents (<doc></doc>) identified by their id and containing also information about its url, genre, year and epoch of publishing. Each document is further structured using following tags:
paragraphs <p></p> sentences <s></s> headers <h></h> lists <ul></ul> list lines <li></li> non-greek words <non-greek></non-greek> glue <g/>
The texts were downloaded using WebBootCat according to a URL list generated by a list of Greek words provided by Patakis.
documents : 96861 max doc per server : 250 date : October 2007
Paper related to the BootCat, WebBootCat tool
Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006). WebBootCaT: instant domain-specific corpora to support human translators. In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252