English Corpus for SkELL

English Corpus for SkELL is a text database used in English SkELL interface available at http://skell.sketchengine.co.uk/run.cgi/skell. The corpus does not contain whole documents but only sentences sorted according to their text quality. The score of text quality was computed by the GDEX system and has assigned to each sentence.

The corpus is made up from Wikipedia articles, selected parts of English Web 2013 corpus and Timestamped web corpus and English websites gained by the WebBootCat tool. These sources provide a good example of how English is used in everyday, standard, formal and professional context over 1 billion words in more than 57 million sentences.

Statistics

Source no. of documents no. of words percentage
Wikipedia 22,195,679 ~ 403,715,131 38.73%
English Web 2013 17,791,970 ~ 321,366,791 30.83%
Timestamped web corpus 7,561,747 ~ 149,264,286 14,32%
British National Corpus 5,818,343 ~ 90,390,293 8,67%
WebBootcat 3,814,225 ~ 77,532,968 7,43%
Total 57,181,964 ~ 1,042,269,610 100,00%

Availability

The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).

Changelog

VERSION DESCRIPTION
3.1 first published version
3.2 minor changes to GDEX formula
3.3 Removed first several sentences with wrong encoding
3.4 removed all Project Gutenberg books because of very old language
3.5 removed sentences with spelling errors
3.6 removed sentences containing hapax legomena = words with only one occurrence in the corpus
3.7 new tokenization and tagging

English SkELL Corpus

distribution of text sources

Wikipedia (38.73 %)

English Web 2013 (30.83 %)

Timestamped web corpus (14.32 %)

British National Corpus (8.67 %)

English websites by WebBootCat (7.43 %)