What is corpus compilation?

Each user corpus has to be compiled before it can be used. Compilation involves applying several tools that process the corpus data so that the complete Sketch Engine functionality is available. It involves the computation of Word Sketches, thesaurus, n-grams and trends.

Some other tools may be applied to specific languages or special-use corpora.

Recompile a corpus

A corpus has to be recompiled each time new data are added or new functionality is to be made available, for example, new word sketch grammar or term grammar should be applied to the data.

How to (re-)compile a corpus

A user corpus created from the web (WebBootCaT) is compiled automatically. Sometimes, however, it might be necessary to start compilation manually.

A corpus created by uploading files has to be compiled manually.

Follow these steps:

  • click Home
  • click My own
  • find the corpus and open it by clicking its name
  • click Manage corpus in the left menu
  • click Compile corpus

Corpus compilation options

Compilation options

1 Select a word sketch grammar to be applied, use the recommended option if in doubt. Selecting None will disable Word Sketches for this corpus.

2 If you want to use your own word sketch grammar, upload it here.

3 Select the term definition (term grammar), use the default option if in doubt. Selecting None will disable term extraction for this corpus.

4 Chose the name of the structure that should surround the content of each file in the corpus. In the case of a corpus created from the web, the content of each web page will be enclosed in this structure. Use the default option if in doubt. If you know what you are doing, use, for example, doc, document, text, page, site etc.

5 Tick to activate deduplication. When active, identical and very similar content will be identified and only one instance will be kept. Use 6 to indicate at which level should the content be compared.

Available deduplication options

structure name for files – this is the structure set in 4 – if the content of two files, i.e. web pages, is identified as identical or very similar, one of the pages will be removed
p – paragraph – if two or more paragraphs anywhere in the corpus identified as identical or very similar, only one will be kept, the rest will be deleted, this may result in a paragraph being removed from a text while the rest of text is kept in the corpus with the paragraph missing
s – sentence – as above but at the sentence level

Note that structure names for paragraph or sentence might be different in each corpus. The dropdown might also contain additional structures if they exist in the corpus.

7 This is a complete list of the structures found in the corpus. Tick the ones which should be kept. The unticked ones will be converted to corpus text and will be treated as words/tokens. If in doubt, keep all of the ticked.
It is recommended that you keep at least the g (glue) structure and the structures for sentences and paragraphs (s and p in the screenshot).

8 click Compile so start the corpus compilation

  • click Compile to start compiling the corpus