Download this page as PDF.
Statistics used in Sketch Engine statistics used in sketch engine
1 General reference
Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubícek, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel (2014): The Sketch Engine: ten years on. In Lexicography 1(1): 7–36. DOI: 10.1007/s40607-014-0009-9. ISSN 2197-4292
This document describes statistics used in the Sketch Engine system. Following conventions apply unless specified otherwise:
– corpus size,
– number of occurrences of the keyword in the whole corpus (the size of the concordance),
– number of occurrences of the collocate in the whole corpus,
– number of occurrences of the collocate in the concordance (number of co-occurrences)
2.1 With grammatical relations
Terminology follows Dekang Lin, ACL-COLING 1998: “Automatic Retrieval and Clustering of Similar Words.”
We count frequencies for triples of a first word connected by a specific grammatical relation to a second word, written (word1, gramrel, word2)
|||| – number of occurrences of the triple,
|||| – number of occurrences of the first word in the grammatical relation with any second word
|||| – number of occurrences of the second word in any grammatical relation with any first word
|||| – number of occurrences of any first word in any grammatical relation with any second word: that is, the total number of triples found in the corpus.
3 Word Sketches
Until September 2006 we used a version of MI-Score modified to give greater weight to the frequency of the collocation defined as:
also see MI Score
Since September 2006, noting the scale-dependency of AScore and recent relevant research including Curran 2004 “From Distributional to Semantic Similarity” (PhD Thesis, Edinburgh Univ) we changed the statistic to logDice, based on the Dice coefficient:
For more information on logDice, see: Rychlý, P. (2008). A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, pp. 6–9.
Since June 2015 (word sketch format 4, Manatee version 2.125) the indices were modified so that the score is (more correctly) computed as follows:
logDice general word sketch score (applies in all cases except those listed below)
score for word sketch triples of UNARY grammatical relations
score for a given grammatical relation R as such
score for word sketch display with unified grammatical relations
For example, the score of management for the word sketch team (as a noun) in the BNC corpus is equal to 9.31 (see this word sketch, login required), and it is computed:
Where 433 means the number of cooccurrences for the relation “management as modifier of team” (see); 13919 is the CQL query
lc [ws("team-n", "modifiers of \"%w\"", ".*")] (see); 8314 is the CQL query
lc [ws(".*", "modifiers of \"%w\"", "management-n")] (see).
See the computed result on Google.
To compute a similarity score between word and word , we compare and ’s word sketches in this way:
- find all the overlaps, i. e. where and share a collocation in the same grammatical relation,
e. g.: (beer / wine, OBJECT_OF, drink), where the association score > 0,
- let and be the set of all word sketch triples (headword, relation, collocation) for
and , respectively, where the association score > 0,
- let ,
- let ASi be the association score of a word sketch triple (since September 2006, logDice is used),
- then the distance between and is computed as:
The term is subtracted in order to give less weight to shared triples, where the triple is far more salient with w1 than w2 or vice versa. We find that this contributes to more readily interpretable results, where words of similar frequency are more often identified as near neighbours of each other.
The constant 50 can be changed using the -k option of the mkthes command.
5 Key words, key terms, comparing corpora
Key words are words typical of a focus corpus (a corpus we are interested in) in contrast to a reference corpus (usually a general corpus in the same language as the focus corpus).
The keyness score of a word is calculated according to the following formula:
where is the normalized (per million) frequency of the word in the focus corpus, is the normalized (per million) frequency of the word in the reference corpus, n is the simple math (smoothing) parameter (n = 1 is the default value).
The top key words reflect the domain of the focus corpus very well and can be used to explore differences between corpora in Sketch Engine as shown in Kilgarriff: “Getting to know your corpus”. In Proceedings of Text, Speech and Dialogue 2012, Lecture Notes in Computer Science. Springer, 2012.
Key terms are multi-word noun phrases typical of a corpus. They are defined using term definition rules (similarly to word sketch relations). The keyness score for terms is the same as for words, corpus frequencies of whole term phrases are taken into account in this case.
6 Other statistics
These are the statistics offered under the “collocations” function accessible from the concordance window; these statistics do not involve grammatical relations.
also see T-score
also see MI Score
Church and Hanks, Word Association Norms, Mutual Information, and Lexicography, in Computational Linguistics, 16(1):22-29, 1990
Oakes, Statistics for Corpus Linguistics, 1998
Dunning, Accurate Methods for the Statistics of Surprise and Coincidence, in Computational Linguistics, 19:1 1993
Pedersen, Dependent Bigram Identification, in Proc. Fifteenth National Conference on Artificial Intelligence, 1998
MI.log-f (formerly called salience)
Kilgariff, Rychly, Smrz, Tugwell, “The Sketch Engine”, in Proc. Euralex, 2004