Russian TenTen corpus. Russian web corpus crawled by SpiderLing in 2011. Encoded in UTF-8, cleaned and deduplicated. Tagged by RFTagger + TreeTagger.

See the Russian tagset.

Changelog

v. 1.0 (April 2012)

  • initial version – 15.8 billion words

v. 1.1 (May 9 2014)

  • removed documents containing Ukrainian characters [ІіЇїЄє] or Belarusian characters [Ўў],
  • removed documents from sites yielding high relative frequency of word порно (porn).
  • currently re-processing version – 14.5 billion words

(November 2014)

  • dynamic case, number, gender
  • gender lemma attribute