Arabic Web corpus of TenTen family corpora was crawled by SpiderLing in January 2012. It was encoded in UTF-8, cleaned and deduplicated. The corpus is tokenized and tagged with Stanford Arabic Parser.

See the Arabic tagset summary used in this corpus.


Changelog

v  1.1 (August 2015)

  • tokenized and tagged by Stanford NLP Tools

v 1.0 (April 2012)

  • initial version – 5.8 billion words