English corpus for SKELL

English Corpus for SkELL

English Corpus for SkELL is a text corpus specially built up for the English SkELL interface available at skell.sketchengine.eu. The corpus does not contain whole documents but only sentences sorted according to their text quality. This score was computed by the GDEX system.

The corpus is made up of Wikipedia articles, selected parts of English Web 2013 corpus and Timestamped web corpus and English websites crawled by the WebBootCat tool. These sources provide a good example of how English is used in everyday, standard, formal and professional context over 1 billion words in more than 57 million sentences.

Statistics

Source	no. of words	percentage
Wikipedia	∼ 403,653,953	∼ 38.88%
English Web 2013	∼ 320,892,066	∼ 30.91%
Timestamped web corpus	∼ 146,082,464	∼ 14,07%
British National Corpus	∼ 90,379,212	∼ 8,71%
WebBootcat	∼ 77,192,480	∼ 7,44%
Total	1,038,200,313	100%

Availability

The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).

Tools to work with the English SKELL corpus

A complete set of tools is available to work with this English corpus for SKELL to generate:

word sketch – English collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

VERSION	DESCRIPTION
3.1	first published version
3.2	minor changes to GDEX formula
3.3	Removed first several sentences with wrong encoding
3.4	removed all Project Gutenberg books because of very old language
3.5	removed sentences with spelling errors
3.6	removed sentences containing hapax legomena = words with only one occurrence in the corpus
3.7	new tokenization and tagging
3.8	FFFD Unicode symbols removed or replaced appropriately
3.9	01CE and 01CD Unicode symbols removed
3.10	removed sentences with contracted n’t (due to wrong tags and lemmas)

Bibliography

Baisa, V., & Suchomel, V. (2014, December). SKELL – Web Interface for English Language Learning. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (pp. 63-70).

English SKELL Corpus

Available features enable you to find:

– best sentence examples

– collocation candidates

– similar words

Try SKELL

for learners of English

about Sketch Engine

English Trends corpus

Explore our largest English corpus, which totals over 80 billion words and grows automatically every week.

English Trends corpus

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide