itTenTen: Corpus of the Italian Web
The Italian Web corpus (itTenTen) is a text corpus of Italian internet texts. The corpus is a part of the project TenTen corpus family which is a collection of the same processed web corpora with the target size 10+ billion words. Sketch Engine currently provides access to Tenten corpora in more than 30 languages.
The corpus texts are cleaned, deduplicated and subsequently part-of-speech tagged, lemmatized with the TreeTagger tool using Marco Baroni’s parameter file. The POS tagset description is available here.