Italian web corpus is a corpus from TenTen generation of corpora. It was crawled from the Web in 2010 and contains almost 2.6 billion words. The corpus is tokenised and lemmatised. For tagging, it was used TreeTagger with using Marco Baroni’s parameter file.

See the Italian part-of-speech tagset.

 

Changelog

v. 1.0 (9 September 2010)

  • initial version – 2.6 billion words