itTenTen: Corpus of the Italian Web

The Italian Web corpus (itTenTen) is a text corpus of Italian internet texts. The corpus is a part of the project TenTen corpus family which is a collection of the same processed web corpora with the target size 10+ billion words. Sketch Engine currently provides access to Tenten corpora in more than 30 languages.

The corpus texts are cleaned, deduplicated and subsequently part-of-speech tagged, lemmatized with the TreeTagger tool using Marco Baroni’s parameter file. The POS tagset description is available here.

A complete set of Sketch Engine tools is available to work with this Italian Web corpus to generate:

  • word sketch – Italian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Italian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

v. 1.0 (9 September 2010)

  • initial version – 2.6 billion words

Search the Italian Web corpus

Sketch Engine offers a range of tools to work with the Italian Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine provides access to 350+ language corpora.

Learn to use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.