Simple maths with keywords and terms

Simple maths is the keyness score used in Sketch Engine to identify keywords, terms, key n-grams and key word sketch collocations. Simple maths compares the frequencies in the focus corpus with the frequencies in the reference corpus. Alternatively, two subcorpora in the same corpus or in different corpora can be used.

The N value makes the score prefer more frequent or less frequent items.

A higher N value shifts to focus on higher-frequency words (more common words), whereas a lower N value focusses on low-frequency (rarer words). The value should be changed in orders of magnitude, i.e. 0.1, 1, 10, 100, 1000, 10000 etc. Smaller changes rarely produce any noticeable effect.

The statistics is a variation on “word W is so-and-so times more frequent in corpus X than corpus Y”. The formula is:

$\frac{fpm_{rm focus} + N}{fpm_{rm ref} + N}$

where

$fpm_{rm focus}$ is the normalized (per million) frequency of the word in the focus corpus,

$fpm_{rm ref}$ is the normalized (per million) frequency of the word in the reference corpus,
$N$ is the smoothing parameter ( $N = 1$ is the default value).

Example

Your focus corpus (BNC): 112,289,776 tokens
Frequency of the lemma (shard) in the corpus: 35

Relative frequency

$fpm_{rm focus} = \frac{number~of~hits~\cdot~1,000,000}{corpus~size} = \frac{35~\cdot~1,000,000}{112,289,776} = 0.3117$

Selected reference corpus (ukWaC): 1,559,716,979 tokens
Frequency of the lemma (shard) in the corpus: 263

Keyness score

$Score = \frac{fpm_{rm focus} + N}{fpm_{rm ref} + N} = \frac{0.3117 + 1}{0.1686 + 1} = 1.1224$

For more details see:

Adam Kilgarriff. Simple maths for keywords. In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.

Statistic used in Sketch Engine (Chapter 5). Lexical Computing Ltd., 8 July 2015.

Explore Sketch Engine

Find out whether Sketch Engine is an appropriate tool for your tasks.

about Sketch Engine

Example

For more details see:

Explore Sketch Engine

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine