English TenTen web corpus is the largest English corpus in Sketch Engine. The 2013 version of the corpus contains ca 19 billion words.

The corpus is tagged with TreeTagger using UTF-8 English parameter file.

Structural attributes

Common TenTen corpora attributes

Document

  • region = Am for American English, Br for British English, None for unknown
  • difficulty = All documents were split to 5 bands of the same size by GDEX score trained on learners’ corpora. Band 1 = easiest to understand, band 5 = hardest to understand.

Changelog

v1.0 (15 November 2010)

  • initial version — 3.3 billion tokens
  • crawled by Heritrix in 2008
  • encoded in Latin1

v2.0 (14 June 2012)

  • sample of enTenTen2 — 4.65 billion tokens
  • crawled by SpiderLing in May 2012
  • encoded in UTF-8

v3 (2012)

  • full enTenTen12 — almost 13 billion tokens

v4 (2013)

  • enTenTen13 — almost 23 billion tokens

 2015

  • enTenTen15 processed using TreeTagger pipeline v2