token

Token is the smallest unit that each corpus divides to. Typically each word form and punctuation (comma, dot, …) is a separate token. Therefore, corpora contain more tokens than words. Spaces between words are not tokens. A text is divided into tokens by a tool called tokenizer which is often specific for each language, for example don’t  in English consists of 2 tokens.