The Arabic Web Corpus (arTenTen) is a language corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
part-of-speech (POS) tagged and lemmatized with the MADA tool
We have also created ‘word sketches’: one-page, automatic, corpus-derived summaries of a
word’s grammatical and collocational behavior. We use examples to demonstrate what the corpus can
show us regarding Arabic words and phrases and how this can support lexicography and inform