itWaC: Italian corpus from the .it domain
The Italian web corpus (itWaC) is a language corpus made up of texts collected from the Internet. The corpus consists of 1.5 billion words and was prepared by Marco Baroni. Texts are part-of-speech tagged and lemmatized with the TreeTagger tool. Moreover, users can explore the grammatical and collocational behavior of Italian words as a result of a word sketch grammar prepared Marco Baroni and later updated by Valentina Efrati and Francesca Masini (TRIPLE lab, Roma Tre University). The corpus is cleaned and deduplicated.
See the Italian part-of-speech tagset describing POS tags used in the corpus.