Bulgarian TenTen corpus crawled by SpiderLing in November 2012. It was encoded in UTF-8, cleaned and deduplicated including removal all data from BulgarianNC2. This corpus is not tagged yet.

Current number of tokens is almost 850 million.

Changelog

v. 1.0

  • initial version, obtained from the web in 2012