relative frequency, frequency per million

(also called freq/mill in the interface) is a number of occurrences of an item per million tokens, also called i.p.m. (instances per million). It is used to compare frequencies between corpora (or datasets) of different sizes.

Formula

number of hits : corpus size in millions of tokens = frequency per million

(an alternative calculation producing the same result)
raw frequency : corpus size in tokens × 1000000 = frequency per million

Relative frequency and text types

The frequency per million is always related to the whole corpus or subcorpus, not to a text type. Restricting the query to one or more text types, using the text type selector or specifying text types in CQL, will affect the number of hits but the frequency per million will still be calculated using the number of tokens in the whole (sub)corpus.

To relate the frequency per million to one or more text types, create a subcorpus from the text type(s) and restrict the query to this subcorpus.

Example

Looking up the frequency of the word helps in the British National Corpus (112,181,015 tokens), first in the spoken Text type and then in the spoken subcorpus will produce these results.

SUBCORPUS SELECTED	none	none	spoken 11,787,138 tokens
TEXT TYPE SELECTED	none	spoken	none
HITS	3,116	302	302
FREQUENCY PER MILLION	27.75 in relation to the number of tokens in the whole corpus	2.69 in relation to the number of tokens in the whole corpus	25.62 in relation to the number of tokens in the subcorpus
POSSIBLE INTERPRETATION	helps appears 27.75 times per million tokens in BNC	‘spoken’ helps appears 2.69 times per million tokens in BNC	helps appears 25.62 times per million tokens in the spoken part of BNC

relative frequency, frequency per million

Formula

Relative frequency and text types

Example

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine