What is an n-gram?
An n-gram (also called multi-word unit or MWU) is a contiguous sequence of a given number of items (words, letter etc.). When the item is a word, a unigram is one word, a bigram is a sequence of two words, a trigram is a sequence of three words etc. The items in an n-gram may not have any relation between them apart from the fact that they appear in a sequence next to each other. This means that not each n-gram is a collocation, however, each collocation is an n-gram.
The study of n-grams is important for machine translation (frequent n-grams can be translated and chunks with correct word forms reflecting the surrounding items in the n-gram rather than a sequence of isolated items) or in language learning (frequent n-grams can be learnt as chunks rather than constructed from the individual items each time the student needs to use them).
Generating a list of most frequent n-grams
First, you have to choose a corpus and then click on Word List in the left menu. Here you can choose an attribute (Search attribute), which it will search. The important thing is to tick off “use n-grams” and set the value of n (automatic is 2, maximum is 6). Clicking the button “Make Word List” below shows you n-grams according to the selected option.
Creating n-grams can take several tens of seconds (especially 5- or 6-grams in large corpora).
(1) the word list can be generated from the whole corpus or a subcorpus only, select the subcorpus here, you can also get information about the subcorpus or create a new one from text types
(2) select what you want to count, whether word forms, lemmas or something else. The list of options depends on how the corpus is annotated but will generally include these options:
attributes: word form, tag, lempos, lempos-lc, lemma, word form (lowercase), lemma-lc
word sketch: terms, collocations
text types: text types depend on the corpus selected and will be different for each corpus
(3) tick this options to calculate frequencies of n-grams
(4) when ticked, at the end will be grouped under at the end of because the 3-gram at the end is a sub n-gram of the 4-gram at the end of
See the Word list page for detailed information on creating word lists
(17) search n-grams