Arabic Web corpus of TenTen family corpora was crawled by SpiderLing in January 2012. It was encoded in UTF-8, cleaned and deduplicated. The corpus is tokenized and tagged with Stanford Arabic Parser.
See the Arabic tagset summary used in this corpus.
v 1.1 (August 2015)
- tokenized and tagged by Stanford NLP Tools
v 1.0 (April 2012)
- initial version – 5.8 billion words