English Corpus for SkELL is a text corpus specially built up for English SkELL interface available at http://skell.sketchengine.co.uk/run.cgi/skell. The corpus does not contain whole documents but only sentences sorted according to their text quality. This score was computed by the GDEX system.

The corpus is made up of Wikipedia articles, selected parts of English Web 2013 corpus and Timestamped web corpus and English websites crawled by the WebBootCat tool. These sources provide a good example of how English is used in everyday, standard, formal and professional context over 1 billion words in more than 57 million sentences.


Source no. of words percentage
Wikipedia 403,715,131 38.73%
English Web 2013 321,366,791 30.83%
Timestamped web corpus 149,264,286 14,32%
British National Corpus 90,390,293 8,67%
WebBootcat 77,532,968 7,43%
Total ∼ 1,042,269,610 ∼ 100%


The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).

A complete set of Sketch Engine tools is available to work with this English corpus for SkELL to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context


3.1 first published version
3.2 minor changes to GDEX formula
3.3 Removed first several sentences with wrong encoding
3.4 removed all Project Gutenberg books because of very old language
3.5 removed sentences with spelling errors
3.6 removed sentences containing hapax legomena = words with only one occurrence in the corpus
3.7 new tokenization and tagging
3.8 FFFD Unicode symbols removed or replaced appropriately


