Simple maths is a method for identifying keywords of one corpus vs another. It includes a variable which allows the user to turn the focus either on higher, or lower frequency words.
The statistic we use for keywords is a variant on “word W is so-and-so times more frequent in corpus X than corpus Y”. The keyness score of a word is calculated according to the following formula:
where is the normalized (per million) frequency of the word in the focus corpus, is the normalized (per million) frequency of the word in the reference corpus, is the so-called smoothing parameter ( is the default value).
Your focus corpus (BNC): 112,289,776 tokens
Frequency of the lemma (shard) in the corpus: 35
Selected reference corpus (ukWaC): 1,559,716,979 tokens
Frequency of the lemma (shard) in the corpus: 263
For more details see:
Adam Kilgarriff. Simple maths for keywords. In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.
Statistic used in Sketch Engine (Chapter 5). Lexical Computing Ltd., 8 July 2015.