There are the following Turkic language family corpora in Sketch Engine:
|Turkish||3,370M||12M||Dec 2011, Jan 2012|
Building these corpora along with unsupervised morphological analysis has been described in the paper below.
These corpora have been built in cooperation with Natural Language Processing Centre at Masaryk University in Brno, Czech Republic.
The source texts were crawled by SpiderLing. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (az, kz, kg, tr, tm, uz) (several exceptions were allowed).
- character encoding detected by Chared (byte trigrams based)
- a wrong language filtered out (character trigrams based model)
- boilerplate removed by Justext
- exact duplicates removed by SpiderLing
- similar duplicates removed by onion (paragraph level, 7-tuples of words, 50 % similarity threshold)
- tokenized using unitok (general settings: tokens are delimited by spaces, punctuation is a separate token)
- compiled in the Sketch Engine
September 23, 2012
- The Turkictr part was renamed to trTenTen 
v1.0 (March 6 2012)
- initial version, 6 languages
- no tagging, no sketches
Vít Baisa and Vít Suchomel (2012). Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32.