Arabic Web corpus of TenTen family corpora was crawled by SpiderLing in January 2012. It was encoded in UTF-8, cleaned and deduplicated. The corpus is tokenised and tagged with Stanford Arabic Parser.

Tagset summary

Basic notation

NN  noun
VB  verb
JJ  adjective
RB  adverb
CC  conjunction
IN preposition

Complete notation

For further information visit http://nlp.stanford.edu/software/parser-arabic-faq.shtml#d


Changelog

v  1.1 (August 2015)

  • tokenised and tagged by Stanford NLP Tools

v 1.0 (April 2012)

  • initial version – 5.8 billion words