Sketch Engine can be used for comparing corpora. If you want to compare your own corpora, you need to make sure they are compiled.

There are two possibilities to compare corpora:

  • The compare corpora button in the left side menu leads to a cross table comparing multiple corpora in a single language, based on the most frequent words. It is available for all.
  • Comparing corpora based on word list functionality. Using Word List with Keywords (focus corpus vs. reference corpus, several options to tune)

In fact, “Compare corpora” is “Word List with Keywords” applied to all pairs in the language in both directions (focus = A, reference = B and focus = B, reference = A).

1) Open a focus corpus in the corpus manager (by clicking on a name of a preloaded corpus or by clicking on the magnifying glass icon of a user corpus on the Corpora page).
2) Click Word List in the left side menu.
3) Set Output type to Keywords and select a reference corpus to compare the focus corpus with.
4) Other options in the form may be set/filled as well (e.g. Simple maths parameter)
5) A list of words most specific to the focus corpus in comparison with the reference corpus is displayed after clicking ‘Make Word List’.

Process of comparing corpora

– for every two corpora
– top 5000 words according to frequency (from every corpus separately),
– for every word from unification to count keyword score
– next only top 500 words according to score
– arithmetic mean of their score is a similarity pair of corpora

– symmetric
– caching
– 1 = identical, many = different,
– two similarities are incomparable


In the picture, there is a comparison of various English corpora. The scores in the table stand for corpus similarity. 1 is for identical corpora and the bigger the score (and the darker the grey), the greater the difference between two corpora. You can see that the two enTenTen corpora and the New Model Corpus are all very similar to each other and that the ‘most different’ pair are London English (of informal conversations in two working-class areas of London) and BAWE (of academic written English).


Related paper

Adam Kilgarriff. Comparing Corpora. In International Journal of Corpus Linguistics, Volume 6, Number 1, 2001, pp. 97-133(37)