Arabic TenTen corpus. Crawled by SpiderLing in January 2012. Encoded in UTF-8, cleaned, deduplicated. Not tagged yet.

Changelog

v  1.1 (August 2015)

  • tokenised and tagged by Stanford NLP Tools

v 1.0 (April 2012)

  • initial version – 5.8 billion words