The corpus was prepared by Adriano Ferraresi. The whole process is described in the paper Introducing and evaluating ukWaC, a very large web-derived corpus of English at LREC 2008.
All material is taken from the .uk domain. It was part-of-speech tagged and lemmatised using TreeTagger, a leading part-of-speech tagger which has been trained for a number of languages. It uses Penn Treebank Tagset.
Grammatical relation definitions, as prepared by David Tugwell for other English corpora, were used.
FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.
CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.