Word list is a generic name for all kinds of frequency lists that Sketch Engine can generate. The most frequent type is a word list of word forms or lemmas, but the user can also generate for example:

Examples

In the figure, you can see several word lists extracted from various corpora.

  • A the most frequent morphological tags used in the British National Corpus
  • B frequency list of collocates of the English verb ‘take
  • C frequency list of words in the British National Corpus (BNC)
  • D frequency list of character trigrams from BNC
  • E frequency list of words starting with ‘b
  • F frequency list of the word forms of  ‘be

word_lists

If you need to restrict the word list to a specific part of speech (e.g. adjectives), you have two options:

  • if the corpus has a lempos attribute, select “lempos” as the search attribute and type .*j (for adjectives) into the Regular expression filter, check the corpus details for available lempos attributes in the corpus
  • the corpus does not contain lempos attributes or you need different units than lemmas – select “tag” as the search attribute, type the tag (tags are vary across corpora) into the Regular expression filter, then tick Change output attribute(s) and select one or more output attributes

Other uses might include various statistical information about the corpus, e.g. list of documents in the corpus by the number of tokens etc.

How to generate a word list

  • click Home and select a corpus by clicking it
  • select Word list from the left menu
  • use one of the preset options:

preset options: all words, all lemmas

The left menu contains two pre-selected options:

All words
to generate a frequency list of word forms in the corpus

All lemmas
will generate a frequency list of lemmas in the corpus

OR

  • use the following settings to define what should be included in your word list

word list settings

settings in detail

(1) the word list can be generated from the whole corpus or a subcorpus only, select the subcorpus here, you can also get information about the subcorpus or create a new one from text types

(2) select what you want to count, whether word forms, lemmas or other attributes. The list of options depends on how the corpus is annotated but will generally include these options:
attributes: word form, tag, lempos, lempos-lc, lemma, word form (lowercase), lemma-lc
word sketch: terms, collocations
text types: text types depend on the corpus selected and will be different for each corpus

(3) tick this options to calculate frequencies of n-grams

(4) when ticked, at the end will be grouped under at the end of  because the 3-gram at the end is a sub n-gram of the 4-gram at the end of

Filter options

Exclude the items you are not interested in using the following filters:

(5) use regular expressions to limit the results to a certain pattern
simple example: ca.* produces a frequency list of words starting with ca
please refer to the examples further below

(6) use a limit to exclude low frequency words, use zero to include all words

(7) use a limit to exclude high frequency words

(8) if the frequency should be calculated only for a closed list of words, upload the list here, the file must be a plain text UTF-8 file with one word per line, the items must correspond to a selected attributes, e.g. when lemma is selected as an attribute, goes produces no result because it is not a lemma, when lempos is selected, all items must have a format of a lemma, i.e. go-v, money-n etc.

(9) use the blacklist to exclude a closed set of items from the frequency list

(10) when ticked, non-words will be included in the list

Output options

Here you can specify what should be displayed on the output screen.

(11) frequency figures
hit counts 
– the number of occurrences will be displayed next to each item
document counts – number of documents in the corpus where the item appeared at least once
ARF – average reduced frequency is a specialized statistic

(12) + (13) output type
simple
– will produce a frequency list of all items matching the criteria
keywords – will only include keywords into the frequency list, i.e. specialized terminology related to the topic of the corpus more details. A reference corpus (14) has to be selected (leave the preselected one if not sure, the slider (15) can be used to influence to what extent more common (=less specialized) words should be included.

(16) the results can be calculated for certain attributes but different attributes can be displayed as output, e.g. frequencies can be calculated for lemmas but word forms can be displayed as output, up to 3 attributes can be displayed

Examples of filters with regular expressions

Here are some examples of frequently used word list settings with regular expressions.

A list of nouns

Search attribute: lempos or lempos (lowercase)
Regular expression: .*-n
(-n might not be the noun suffix in all corpora, please refer to the Corpus details screen)

Note: the same result can be achieved by searching tags but lempos produce the results faster. To hide -n in the results, use Change output attributes: lemma or word

A list of 2- to 4-letter acronyms

The wordlist will contain all words written with 2 to 4 upper case letters.

Search attribute: word
Regular expression: [A-Z]{2,4}

A list of verbs and nouns beginning with re-

Search attribute: lempos
Regular expression: re-.*-[v|n]
(-n and -v might not be the right suffixes in all corpora, please refer to the Corpus details screen)

Change output attributes seems to produce incorrect counts?

When you use the Change output attributes option, the frequencies may not be calculated from the whole corpus. With this option selected, it is compulsory to use a regular expression. First, a concordance for the words matching the regular expression is created and the frequency is calculated only from the first 10 million hits. If the corpus is large and there are more than 10 million hits matching the regular expression, they will be ignored.

Using a regular expression such as .* to match any word works exactly the same: a concordance will be created for the first 10 million words. If the corpus is bigger than 10 million words, the rest of the corpus will not be included in the frequency. The output screen notifies you about this and offers the option of using random 10 million rather than the first 10 million lines.

Word list limitations

There is no limit to the number of word lists a user can generate. The number of items in each wordlist is subject to these restrictions:

Word lists generated from:

user corpora – there is no limit to the number of word list items in each word list

preloaded corpora – the maximum number of items in each word list is limited to 1,000.

Bypassing the limits

Although there is a limit of 1,000 items in the case of preloaded corpora (user corpora are not subject to any limitation), the interface offers powerful filtering using regular expressions so that only the required words are included and the 1,000-word limit might be sufficient for you. The user can request access to unlimited word lists, please see the conditions below.

Commercial or lexicographic purposes

For retrieving unlimited word lists in order to use them for any commercial purposes or lexicography please request a quote from us.

Research-only purposes

Retrieving full word lists from a preloaded corpus for research purposes is only possible upon signing a research agreement and paying an administrative fee of 280 EUR + VAT for a single user account.

To lift the limit from your account, please download Sketch Engine Research Licence for Word Lists, fill it in, sign, scan and send it back to inquiries@sketchengine.co.uk together with your invoicing address and your Sketch Engine username.

Note though that we still reserve the right to apply some limits for technical reasons, usually at 10 million items.