The corpus is prepared by Corpus factory method described here. Full details are described in Kilgarriff et al. at LREC 2010.

Changelog

v 2.0 (5 May 2010)

fixed tokenisation problems (Standard tokenisation program unitok.py is used)