This corpus is collected using Cantonese only seed words and Corpus Factory method. We hope the corpus represent Cantonese only text without mixing with other Chinese variants. Full details are described in A Corpus Factory for Many Languages (Kilgarriff et al 2010).
Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A Corpus Factory for Many Languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.