Search the ukWaC British English corpus

The corpus was prepared by Adriano Ferraresi. The whole process is described in the paper Introducing and evaluating ukWaC, a very large web-derived corpus of English at LREC 2008.

All material is taken from the .uk domain, therefore it is fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.

It was part-of-speech tagged and lemmatised using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.

Grammatical relation definitions, as prepared by David Tugwell for other English corpora, were used.

Sketch Engine also has a version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).

Search the ukWaC corpus

Sketch Engine offers a range of tools to work with the ukWaC corpus.



FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.

CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.