Italian itWaC corpus

This Italian corpus was prepared by Marco Baroni in a web crawl as described at EACL 2006 (paper available here).

It was part-of-speech tagged and lemmatised using TreeTagger, an open-source part-of-speech tagger which has been trained for a number of languages.

Italian word sketches were prepared by Marco Baroni and later updated by Valentina Efrati and Francesca Masini ( TRIPLE lab, Roma Tre University).

Search the itWaC corpus in Sketch Engine

Sketch Engine offers a range of tools to work with this Italian corpus.


A complete set of Sketch Engine tools is available to work with this Italian itWaC corpus to generate:

  • word sketch – Italian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Italian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Learn to use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.