Tajik Web Corpus | Sketch Engine

tgWaC: Corpus of Tajik Web

The Tajik Web Corpus (tgWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). Data was crawled by the SpiderLing web spider in the years 2011–2013 and comprise of more than 93 million words with part-of-speech tagging.

Authors of this corpus are Vít Suchomel and Pavel Šmerk.

Part-of-speech tagset

The POS tags were created from the lemma of given word and the numbers determining one of 16 POS categories, see the part-of-speech tagset legend.

Tools to work with the Tajik corpus

A complete set of Sketch Engine tools is available to work with this Tajik Web corpus to generate:

word lists – lists of Tajik nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word
text type analysis – statistics of metadata in the corpus

Changelog

2012

corpus extended – 93 million words
corpus was tagged – tag consisted of lemma and POS

2011

corpus created – ca 50 million words

Bibliography

DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. POS Annotated 50M Corpus of Tajik Language. In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL 8/AfLaT 2012). Istanbul: European Language Resources Association (ELRA), 2012, pp. 93–98. ISBN 978-2-9517408-7-7.

DOVUDOV, Gulshan, Vít SUCHOMEL a Pavel ŠMERK. Towards 100M Morphologically Annotated Corpus of Tajik. In Aleš Horák, Pavel Rychlý. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. Brno: Tribun EU, 2012, pp. 91–94. ISBN 978-80-263-0313-8.

Search the Tajik Web corpus

Sketch Engine offers a range of tools to work with the Tajik corpus from the Web.

open in Sketch Engine

about Sketch Engine

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

corpora in Sketch Engine

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

tgWaC: Corpus of Tajik Web

Part-of-speech tagset

Tools to work with the Tajik corpus

2012

2011

Search the Tajik Web corpus

Other text corpora in Sketch Engine

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine