enTenTen: Corpus of the English Web

The English Web Corpus (enTenTen) is a language corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

EnTenTen corpora are tagged by TreeTagger using Penn TreeBank tagset with Sketch Engine modifications.

Overview of English TenTen corpora

These web corpora were crawled and processed repeatedly during the last ten years:

  • English Web corpus 2015 (enTenTen15) – 15 billion words (advanced genre classification and sophisticated spam removal), the corpus has not published yet.
  • English Web corpus 2013 (enTenTen13) – 19 billion words
  • English Web corpus 2012 (enTenTen12) – 11 billion words
  • English Web corpus 2008 (enTenTen08) – 2.7 billion words

A complete set of Sketch Engine tools is available to work with this English Web corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

English Web 2015 (enTenTen15)

  • initial size 28 billion words

v2 (spring 2017)

  • 15 billion words
  • genre classification
  • depth analysis of spam and its removal including too short documents

English Web 2013 (enTenTen13)

  • 19 billion words

English Web 2012 (enTenTen12)

version 1 (14 June 2012)

  • sample of corpus – 3.7 billion words
  • crawled by SpiderLing in May 2012
  • encoded in UTF-8

version 2 (2012)

  • full corpus – 11 billion words

English Web 2008 (enTenTen08)

version 1 (15 November 2010)

  • initial version – 3.3 billion tokens
  • crawled by Heritrix in 2008
  • encoded in Latin1

Bibliography

Kilgarriff, A. (2012, September). Getting to know your corpus. In International Conference on Text, Speech and Dialogue (pp. 3-15). Springer Berlin Heidelberg.

Search the English Web corpus

Sketch Engine offers a range of tools to work with the English Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.