Tajik Web Corpus

The Tajik corpus is a 93-million-word web corpus created mainly from the Tajik top level domain .tj (ca 43 million words) and the Tajik news portal ozodi.org (ca 19 million words).

The initial version of the corpus contained ca 50 million and it was created in 2011. Then the corpus was extended to almost 93 million words and part-of-speech tagged.

The part-of-speech tagging system is created from the lemma of given word and the numbers determining one of 16 POS categories.

The Tajik tagset summary shows all possible POS tags including examples from the corpus.

Authors of this corpus are Vít Suchomel and Pavel Šmerk.


  • 2012
    • corpus extended – 93 million words
    • corpus was tagged – tag consisted of lemma and POS
  • 2011
    • corpus created – ca 50 million words


The corpus is accessible to all users including trial users.


DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. POS Annotated 50M Corpus of Tajik Language. In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL 8/AfLaT 2012). Istanbul: European Language Resources Association (ELRA), 2012, pp. 93–98. ISBN 978-2-9517408-7-7.

DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. Towards 100M Morphologically Annotated Corpus of Tajik. In Aleš Horák, Pavel Rychlý. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. Brno: Tribun EU, 2012, pp. 91–94. ISBN 978-80-263-0313-8.

Tajik Web Corpus

93 million words

part-of-speech tagging

text types for years