Turkish TenTen corpus. Crawled by SpiderLing in December 2011 and January 2012. The corpus was deduplicated by​ Onion, tokenized using unitok and encoded in UTF-8.

Current version of the corpus has more than 3.3 million words.

v. 1.0

  • initial version, obtained from the web in 2012
  • no tagging, no sketches