TajikWaC: Corpus of Tajik Web

The Tajik Web Corpus (TajikWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages(Kilgarriff et al. at LREC 2010). Data was crawled by the SpiderLing web spider in the years 2011–2013 and comprise of more than 93 million words with part-of-speech tagging.

Authors of this corpus are Vít Suchomel and Pavel Šmerk.

Part-of-speech tagset

The POS tags were created from the lemma of given word and the numbers determining one of 16 POS categories, see the part-of-speech tagset legend.

Tools to work with the Tajik Web corpus

A complete set of Sketch Engine tools is available to work with this Tajik Web corpus to generate:

  • word lists – lists of Tajik nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

2012

  • corpus extended – 93 million words
  • corpus was tagged – tag consisted of lemma and POS

2011

  • corpus created – ca 50 million words

Bibliography

DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. POS Annotated 50M Corpus of Tajik Language. In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL 8/AfLaT 2012). Istanbul: European Language Resources Association (ELRA), 2012, pp. 93–98. ISBN 978-2-9517408-7-7.

DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. Towards 100M Morphologically Annotated Corpus of Tajik. In Aleš Horák, Pavel Rychlý. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. Brno: Tribun EU, 2012, pp. 91–94. ISBN 978-80-263-0313-8.

Search the Tajik Web corpus

Sketch Engine offers a range of tools to work with the Tajik Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.