ukWaC – British English corpus from the .uk domain

The ukWaC is a text corpus of British English collected from the .uk domain with using medium-frequency words from the British National Corpus as seed words. These two facts are fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.

The corpus was prepared by Adriano Ferraresi and word sketches enabling to explore the grammatical relations of words were prepared by David Tugwell. The whole preparation of the corpus is described in Introducing and evaluating ukWaC, a very large web-derived corpus of English (LREC conference, 2008).

Sketch Engine provides access to the version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).

Part-of-speech tagset

It was part-of-speech tagged and lemmatized using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.

A complete set of Sketch Engine tools is available to work with this ukWaC corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Bibliography

BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corporaLanguage resources and evaluation, 2009, 43.3: 209-226.

FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.

CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.

Search the British English corpus

Sketch Engine offers a range of tools to work with the ukWaC corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.