This Romanian web as corpus was gathered by Monica Macoveiciuc, Alexandru Ioan Cuza University, Iasi from the web using two methods, based on WebBootCat and Heritrix. The text collected through these tools was further processed in order to remove the unwanted content. First version: August 2009. A programme of additions and improvements over a number of years is anticipated.
It was part-of-speech tagged and lemmatized using TTL (Tokenizing, Tagging and Lemmatizing free running texts), developed by RACAI – Research Institute for Artificial Intelligence, Romanian Academy.
Word sketches were prepared by Monica Macoveiciuc.
See the Romanian tagset.