The corpus was prepared by Adriano Ferraresi. The whole process is described in the paper Introducing and evaluating ukWaC, a very large web-derived corpus of English at LREC 2008.

All material is taken from the .uk domain. It was part-of-speech tagged and lemmatised using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.

Grammatical relation definitions, as prepared by David Tugwell for other English corpora, were used.

In Sketch Engine is also the version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).


Bibliography

FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.

CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.