Most frequent or most typical collocations – which is more useful?

Word sketches in Sketch Engine are one-page summaries of word combinations (called collocations) that the word prefers. These summaries are computed automatically based on a sample of language of billions of words called a text corpus.

An example of a word sketch might look like this:

The combinations are divided into categories such as modifiers, verbs, objects or subjects of a verb etc.

Apparently, each word can form more word combinations than those displayed in a word sketch by default. So how does Sketch Engine determine which collocations should be displayed? Where is the cut-off line? Users generally assume that this happens on the basis of frequency and that the collocations at the top of the list are the most frequent collocations. This would be, in most cases, not very useful as we will see further below. Sketch Engine takes a different approach and focuses on the typicality (or strength of collocation) rather than frequency of use.

What is the difference between frequency and typicality?

Frequency (weak collocations)

Surprisingly, the fact that a word combination is frequent is often of limited use or even insignificant in terms of language teaching/learning or language research. For example, here are the most frequent collocations of the word bedroom (only adjectives modifying the noun are included)

small
own
spare
twin
front
main
comfortable
big
large

Looking at the list, one notices that most of the words are very predictable. In other words, if a student of English wants to speak about a bedroom of a small size, they will naturally use the word small. They will not usually need to consult a dictionary to make sure that small is a suitable word combination. Similarly, when teaching bedroom as a new word, it is not useful to point the student to collocations such as small, own, big or comfortable because they are quite predictable. The collocations in this list would be classified as weak collocations.

Typicality (strong collocations)

On the other hand, typicality refers to collocations useful for learning or teaching or for inclusion in a dictionary. Typicality focuses on collocations which are not (completely) predictable. An example of such a collocation from the list above is twin bedroom. A collocation list for bedroom ordered by the typicality score will look quite different with these items at the top:


master
double
spacious
spare
en-suite
upstairs
twin
guest
air-conditioned

This list is more useful for language learning and more interesting for linguists and lexicographers. It all depends on the language level, of course, and the first list might be actually of some use to beginners but it is the second list that we would expect to see when we want to learn how word bedroom is used in English.

How does the software do it?

There is a very complex and sophisticated algorithm behind word sketches that identifies collocations and calculates the collocation score used to decide whether the collocation will be included in the word sketch. To get a rough understanding of how these collocations are identified, we can imagine the process as follows:

First, the algorithm identifies all instances of adjective + bedroom combinations in the corpus. Then Sketch Engine takes the adjective and looks for all small + noun combinations in the corpus. Each time small is found together with bedroom, it gets a plus point and each time it is found in combination with another noun it gets a minus point. (The actual algorithm is more complex but even this simplification is sufficiently illustrative.) As a result, the algorithm will classify collocations like this:

  • adjectives that tend to combine with a large selection of other words, i.e. are very flexible in their use, will result as weak collocations and will not be generally included in the word sketch
  • adjectives that only combine with one or a handful of nouns (they ‘specialize’ in combining with certain nouns only) will result as strong collocations and will be included in the word sketch
  • even collocations such as small print will be included because the noun print does not combine with too many other adjectives so there is not too much competition for small

By default, the collocates in a word sketch will be sorted by the score and the top 25 items will be displayed. The user can change this limit and also switch to sorting by the frequency which will put the less typical (and from the teaching point of view also less advanced items) at the top.