Turkic web corpora include a set of a language corpus made up of texts collected from the Internet. There are six Turkic languages: Azerbaijani, Kazakh, Kyrgyz, Turkish, Turkmen, and Uzbek. For more information about Turkish, see the Turkish Web corpus page.
The overview of Turkic corpora
||WORDS (in million)
||DOCUMENTS (in thousands)
||Dec 2011, Jan 2012
The source texts were crawled by the SpiderLing web spider in December 2011 and January 2012. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (.az, .kz, .kg, .tr, .tm, .uz), several exceptions were allowed.