Token is the smallest unit that each corpus divides to. Typically each word form and punctuation (comma, dot, …) is a separate token. Therefore, corpora contain more tokens than words. Spaces between words are not tokens. A text is divided into tokens by a tool called tokenizer which is often specific for each language, for example don’t in English consists of 2 tokens.
Adam Kilgarriff Prize
Adam Kilgariff (1960-2015) was a British corpus linguist and founder of Lexical Computing, the company behind Sketch Engine. Adam devoted his whole life to research at the intersection of corpus linguistic, computational linguistics and lexicography.
To honour our brilliant and much-loved colleague, we established the Adam Kilgarriff Prize for outstanding work in the fields to which Adam contributed so much: corpus linguistics, computational linguistics, and lexicography.