Simple maths is a method for identifying keywords of one corpus vs another. It includes a variable which allows the user to turn the focus either on higher, or lower frequency words.With this

With this method, users can find keywords in their texts uploaded to Sketch Engine which is the ultimate tool to explore how language works. Its algorithms analyze authentic texts of billions of words (text corpora) to identify instantly what is typical in language and what is rare, unusual or emerging usage.

The statistic we use for keywords is a variant on “word W is so-and-so times more frequent in corpus X than corpus Y”. The keyness score of a word is calculated according to the following formula:

\frac{fpm_{\rm focus} + N}{fpm_{\rm ref} + N}

where fpm_{\rm focus} is the normalized (per million) frequency of the word in the focus corpus, fpm_{\rm ref} is the normalized (per million) frequency of the word in the reference corpus, N is the so-called smoothing parameter (N = 1 is the default value).

Example

Your focus corpus (BNC): 112,289,776 tokens
Frequency of the lemma (shard) in the corpus: 35

Relative frequency fpm_{\rm focus} = \frac{number~of~hits~\cdot~1,000,000}{corpus~size} = \frac{35 \cdot 1,000,000}{112,289,776} = 0.3117

Selected reference corpus (ukWaC): 1,559,716,979 tokens
Frequency of the lemma (shard) in the corpus: 263

Relative frequency fpm_{\rm ref} = \frac{number~of~hits~\cdot~1,000,000}{corpus~size} = \frac{263 \cdot 1,000,000}{1,559,716,979} = 0.1686

Score = \frac{fpm_{\rm focus} + N}{fpm_{\rm ref} + N} = \frac{0.3117 + 1}{0.1686 + 1} = 1.1224


For more details see:

Adam Kilgarriff. Simple maths for keywords. In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.

Statistic used in Sketch Engine (Chapter 5). Lexical Computing Ltd., 8 July 2015.

Explore Sketch Engine

Find out whether Sketch Engine is an appropriate tool for your tasks.