ruWaC: Russian web corpus

The Russian web corpus (ruWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared by Serge Sharoff at the University of Leeds according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

The ruWaC corpus comprises of 140 million words and contains word sketches created by Maria Khokhlova.

Part-of-speech tagset

The Russian WaC corpus was POS tagged with the TreeTagger that has been trained for Russian also by Serge Sharoff. The part-of-speech tagset legend is available here.

Tools to work with the Russian Web corpus

A complete set of Sketch Engine tools is available to work with this ruWaC corpus to generate:

  • word sketch – Russian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Russian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 2 (28th August 2017)

  • created lemposes

initial version (2009)

  • size 147 million words

Bibliography

Corpus factory method

Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.

Russian word sketches

Khokhlova, M. (2010). Building Russian Word Sketches as Models of PhrasesProc. EURALEX 2010, Leeuwarden.

Search the Amharic Web corpus

Sketch Engine offers a range of tools to work with the Amharic Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.