itWaC: Italian corpus from the .it domain

The Italian web corpus (itWaC) is a language corpus made up of texts collected from the Internet. The corpus consists of 1.5 billion words and was prepared by Marco Baroni. Texts are part-of-speech tagged and lemmatized with the TreeTagger tool. Moreover, users can explore the grammatical and collocational behavior of Italian words as a result of a word sketch grammar prepared Marco Baroni and later updated by Valentina Efrati and Francesca Masini (TRIPLE lab, Roma Tre University). The corpus is cleaned and deduplicated.

Part-of-speech tagset

See the Italian part-of-speech tagset describing POS tags used in the corpus.

Search the itWaC corpus

Sketch Engine offers a range of tools to work with this Italian corpus.

or

A complete set of Sketch Engine tools is available to work with this Italian itWaC corpus to generate:

  • word sketch – Italian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Italian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Bibliography

BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corporaLanguage resources and evaluation, 2009, 43.3: 209-226.

BARONI, Marco; KILGARRIFF, Adam. Large linguistically-processed web corpora for multiple languages. In: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics, 2006, pp. 87–90.

Other text corpora in Sketch Engine

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.