The Italian Web corpus (itTenTen) is a corpus made up of texts collected from the Internet. The corpus is a part of the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
The corpus texts are cleaned, deduplicated and subsequently part-of-speech tagged, lemmatized with the TreeTagger tool using Marco Baroni’s parameter file. The POS tagset description is available here.
Overview of Italian TenTen corpora
Italian Web 2016 (itTenTen16) – 4.9 billion words (end of May – mid-August)
Italian Web 2010 (itTenTen10) – 2.5 billion words
A complete set of Sketch Engine tools is available to work with this Italian Web corpus to generate:
word sketch – Italian collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Italian nouns, verbs, adjectives etc. organized by frequency