The corpus was prepared by Adriano Ferraresi. The whole process is described in the paper Introducing and evaluating ukWaC, a very large web-derived corpus of English at LREC 2008.

All material is taken from the .uk domain, therefore it is fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.

It was part-of-speech tagged and lemmatised using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.

Grammatical relation definitions, as prepared by David Tugwell for other English corpora, were used.

Sketch Engine also has a version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).

Sketch Engine offers a range of tools to work with the ukWaC corpus.



