The corpus was prepared by Marco Baroni in a web crawl as described at the paper below.

It was part-of-speech tagged and lemmatised using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages.

Word sketches currently in preparation.


Bibliography

Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A Corpus Factory for Many Languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.