ruWaC: Russian web corpus
The Russian web corpus (ruWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared by Serge Sharoff at the University of Leeds according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
The ruWaC corpus comprises of 140 million words and contains word sketches created by Maria Khokhlova.
The Russian WaC corpus was POS tagged with the TreeTagger that has been trained for Russian also by Serge Sharoff. The part-of-speech tagset legend is available here.