VOLE (Varieties of Learner English) is a corpus gathered in 2010 to explore different varieties of English. The crawl used the BootCaT process. We generated a set of triples of mid-frequency English words, and then sent each of them to a search engine seven times over, with seven different geographic constraints; for UK, US, Canada, Australia, China, Japan, Korea.

The original goal was to find distinctive patterns of use of Chinese, Japanese, Korean learners of English, following on work by Tetreault and Chodorow. In that, the corpus was a failure: it seemed like our CJK corpora did not have dominant components of learner English. However the dataset did show interesting differences between the regional native-speaker varieties, with top keywords for Australian, for example, including bloke, whinge, footy and kangaroo.


Reference

Joel Tetreault and Martin Chodorow. 2009. Examining the Use of Region Web Counts for ESL Error Detection. In Proc. WAC-5, San Sebastian, Spain, pp. 71–78