English TenTen web corpus is the largest English corpus in Sketch Engine. The 2013 version of the corpus contains ca 19 billion words.
The corpus is tagged with TreeTagger using UTF-8 English parameter file.
- region = Am for American English, Br for British English, None for unknown
- difficulty = All documents were split to 5 bands of the same size by GDEX score trained on learners’ corpora. Band 1 = easiest to understand, band 5 = hardest to understand.
v1.0 (15 November 2010)
- initial version — 3.3 billion tokens
- crawled by Heritrix in 2008
- encoded in Latin1
v2.0 (14 June 2012)
- sample of enTenTen2 — 4.65 billion tokens
- crawled by SpiderLing in May 2012
- encoded in UTF-8
- full enTenTen12 — almost 13 billion tokens
- enTenTen13 — almost 23 billion tokens
- enTenTen15 processed using TreeTagger pipeline v2