itWaC: Italian web corpus
The Italian web corpus (itWac) is a language corpus made up of texts collected from the Internet. The corpus consists of 1.5 billion words and was prepared by Marco Baroni. Texts are part-of-speech tagged and lemmatized with the TreeTagger tool. Moreover, users can explore the grammatical and collocational behavior of Italian words as a result of a word sketch grammar prepared Marco Baroni and later updated by Valentina Efrati and Francesca Masini (TRIPLE lab, Roma Tre University).
See the Italian part-of-speech tagset describing POS tags used in the corpus.
The corpus is cleaned and deduplicated. More information about this corpus building can be found in Marco Baroni & Adam Kilgarriff (2006).