Turkic web corpora include a set of a language corpus made up of texts collected from the Internet. There are six Turkic languages: Azerbaijani, Kazakh, Kyrgyz, Turkish, Turkmen, and Uzbek. For more information about Turkish, see the Turkish Web corpus page.

The overview of Turkic corpora

AZERBAIJANI 93M 365k Jan 2012
KAZAKH 137M 378k Jan 2012
KYRGYZ 19M 67k Jan 2012
TURKISH 3.37M 12M Dec 2011, Jan 2012
TURKMEN 2M 5k Jan 2012
UZBEK 18M 57k Jan 2012

Source data

The source texts were crawled by the SpiderLing web spider in December 2011 and January 2012. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (.az, .kz, .kg, .tr, .tm, .uz), several exceptions were allowed.

Tools to work with the Turkic web corpora

A complete set of Sketch Engine tools is available to work with these Turkic web corpora to generate:

  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context


September 23, 2012

  • The Turkic part crawled from the Turkish domain .tr was renamed to trTenTen [2012]

initial version (March 6, 2012)

  • initial version, 6 languages
  • no tagging, no sketches


Turkic Web corpora

Vít Baisa and Vít Suchomel (2012). Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32.

Search the Turkic web corpora

Sketch Engine offers a range of tools to work with the Turkic web corpora.


Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.