RuSkell was prepared with the help of the following participants from the National research University “Higher School of Economics”:
  • Valentina Apresjan (general guidance)
  • Andrey Shestakoff, Ekaterina Chernyak (mentors of corpus cleaning)
  • Timur Iskhakov (corpus cleaning)
The corpus was crawled by SpiderLing in 2011, encoded in UTF-8, cleaned and deduplicated. Tagged by  RFTagger + TreeTagger. The corpus is cleared of obscene language, using a list of word, prohibited for naming in “.рф” domain space. The size of unzipped corpus is approximately 52 GB. It consists of  983,255,513 tokens or 10,394,826 unique lemmas.
Reference
APRESJAN, Valentina, Vít BAISA, Olga BUIVOLOVA a Olga KULTEPINA. RuSkELL: Online Language Learning Tool for Russian Language. In Tinatin Margalitadze, George Meladze. Proceedings of the XVII EURALEX International congress. Tbilisi: Ivane Javakhishvili Tbilisi State University, 2016. s. 292-299, 8 s. ISBN 978-9941-13-542-2.
PDF is available online. See page 292.