Corpus of English Wikipedia

The English Wikipedia corpus is a text corpus created from the English internet encyclopedia Wikipedia in 2014. For the building of the corpus was used Wikipedia dump (from the second half of September 2014). The XML structure was converted using WikiExtractor.py. The corpus contains 1.3 billion words and texts are lemmatized and morphologically analyzed.

Part-of-speech tagset

The corpus was POS tagged with TreeTagger using Penn TreeBank tagset.

A complete set of Sketch Engine tools is available to work with this English Wikipedia corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Other Wikipedia corpora in Sketch Engine

Sketch Engine team can the make-to-order Wikipedia corpus of any language of make-to-order. Please email us at inquiries@sketchengine.co.uk if you interested in this.

Search the English Wikipedia corpus

Sketch Engine offers a range of tools to work with the English Wikipedia corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.