Arabic Web corpus of TenTen family corpora was crawled by SpiderLing in January 2012. It was encoded in UTF-8, cleaned and deduplicated. The corpus is tokenised and tagged with Stanford Arabic Parser.
For further information visit http://nlp.stanford.edu/software/parser-arabic-faq.shtml#d
v 1.1 (August 2015)
- tokenised and tagged by Stanford NLP Tools
v 1.0 (April 2012)
- initial version – 5.8 billion words