Simple math is a simple method for identifying keywords of one corpus vs another. That includes a variable which allows the user to focus on higher or lower frequency words.

The statistic we use for keywords is a variant on ‘word W is N times as frequent in subcorpus X vs subcorpus Y’. The keyness score of a word is calculated according to the following formula:

\frac{fpm_{\rm focus} + n}{fpm_{\rm ref} + n}

where fpm_{\rm focus} is the normalized (per million) frequency of the word in the focus corpus, fpm_{\rm ref} is the normalized (per million) frequency of the word in the reference corpus, n is the simple math (smoothing) parameter (n = 1 is the default value).

Example

Your focus corpus (BNC): 112,289,776 tokens
Frequency of the lemma (shard) in the corpus: 35

(relative frequency: \frac{number~of~hits~\cdot~1,000,000}{corpus~size} that is \frac{35 \cdot 1,000,000}{112,289,776} = 0.3117\dots) [fpm_{\rm focus}]

 

Chosen referent corpus (ukWaC): 1,559,716,979 tokens
Frequency of the lemma (shard) in the corpus: 263

(relative frequency: \frac{number~of~hits~\cdot~1,000,000}{corpus~size} that is \frac{263 \cdot 1,000,000}{ 1,559,716,979} = 0.1686\dots) [fpm_{\rm ref}]

 

Selected value of N: 1 [n]

 

\frac{fpm_{\rm focus} + n}{fpm_{\rm ref} + n} that is \frac{0.3117 + 1}{0.1686 + 1} = 1.1224

 


For more details see:

Adam Kilgarriff. Simple maths for keywords. In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.

Statistic used in Sketch Engine (Chapter 5). Lexical Computing Ltd., 8 July 2015.