See how is the frequency per million counted.

Why is different frequency per million between queries restricted on e.g. specific text type and looked queries up in selected subcorpus?

A number of frequency per million always depends on the whole corpus (subcorpus). Therefore text type filter or other constraints have only an effect on number of hits.


British National Corpus (BNC) has 112,181,015 tokens. (queries below depend on the whole corpus)

1. CQL: [word=”helps”]  -> hits: 3,055 (27.23 per million)

2. CQL: [word=”helps” & tag=”V.*”] -> hits: 2,774 (24.70 per million)

3. CQL: [word=”helps”] and selected only spoken Text type -> hits: 291 (2.59 per million)

(Other queries depend on a subcorpus)

4. CQL: [word=”helps”]  and selected a written subcorpus -> hits: 2,764 (27.54 per million), this subcorpus has 100,351,427 tokens

5. CQL: [word=”helps”]  and selected a spoken subcorpus -> hits: 291 (24.60 per million), this subcorpus has 11,829,588 tokens (approx. 10 % of the whole corpus)


The difference between spoken Text type [ex. 3] and the spoken subcorpus [ex. 5] is great because of the third example has the whole corpus (over 110 m tokens) in the calculation of frequency but the fifth example calculates only the spoken subcorpus (approx. 12 m tokens). There is almost 10 times difference.