There are the following Turkic language family corpora in Sketch Engine:

language words documents data updates
Azerbaijani 93M 365k Jan 2012
Kazakh 137M 378k Jan 2012
Kyrgyz 19M 67k Jan 2012
Turkish 3,370M 12M Dec 2011, Jan 2012
Turkmen 2M 5k Jan 2012
Uzbek 18M 57k Jan 2012

Building these corpora along with unsupervised morphological analysis has been described in the paper below.

These corpora have been built in cooperation with Natural Language Processing Centre at Masaryk University in Brno, Czech Republic.

Source data

The source texts were crawled by SpiderLing. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (az, kz, kg, tr, tm, uz) (several exceptions were allowed).

Postprocessing pipeline

  • character encoding detected by Chared (byte trigrams based)
  • a wrong language filtered out (character trigrams based model)
  • boilerplate removed by Justext
  • exact duplicates removed by SpiderLing
  • similar duplicates removed by onion (paragraph level, 7-tuples of words, 50 % similarity threshold)
  • tokenized using unitok (general settings: tokens are delimited by spaces, punctuation is a separate token)
  • compiled in the Sketch Engine


September 23, 2012

  • The Turkictr part was renamed to trTenTen [2012]

v1.0 (March 6 2012)

  • initial version, 6 languages
  • no tagging, no sketches


Vít Baisa and Vít Suchomel (2012). Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32.