What is a subcorpus?

Each corpus can (but does not have to) be divided into smaller parts called subcorpora. Subcorpora can be used to divide the corpus by the type (fiction, newspaper), media (spoken, written) or time (e.g. by years) or by any other criteria. Subcorpora can be overlapping, the same segment can appear in several subcorpora it belongs to.

Concordance searches and word lists can make use of subcorpora by searching only one part of the corpus or by providing statistics of the same phenomenon in different subcorpora, e.g. in written vs. spoken language or in fiction vs. newspaper.

How to create a subcorpus?

A corpus can be divided into subcorpora using a configuration file or can be divided into subcorpora later. This page explains the latter. Such subcorpora are only available to the user who created them. Expert users can set up subcorpora shared with all users.

You will be able to make use of your subcorpora in Concordance searches and word lists.

A subcorpus from text types

This procedure will create a subcorpus from text types. This option can only be used if the corpus was annotated for text types.

detailed instructions

To access the text type selection screen:

  • click Home in the left menu, select a corpus, click the Text types link under the search box and click create new

OR

  • click create new whenever you see the subcorpus dropdown throughout the system

The following screen will be displayed:

create subcorpus from text types

  • type the name of your subcorpus
  • select the types you wish to include in your subcorpus
  • you can select as many text types from as many groups as you wish
  • selecting all in a group is the same as selecting none
  • selected items from the same group are interpreted as OR
  • selections in different groups are interpreted as AND

In the example, a document will be included in the subcorpus if it is Spoken context-governed OR Written-to-be-spoken AND at the same time it is from a Book OR Periodical.

In the free text entry box, typing a part of the value a drop down selection box will appear with suggestions. To select another option, type the pipe ‘ | ‘ and repeat the procedure. Click Documentation to view the available options.

click “Create Subcorpus” at the bottom of the page.

A subcorpus from a concordance

A subcorpus can be created from concordance lines. The user can decide how much text should be included in the subcorpus by selecting the complete document, the paragraphs or only the sentences the concordance lines come from These are only examples, different corpora can be divided into different structures.

A subcorpus from a concordance is especially useful with preloaded corpora. It gives the user the opportunity to only select documents related to a certain topic and then generate concordances or word lists covering only the selected topic area.

detailed instructions

  • Open a corpus and make a concordance
    Think carefully about the search criteria so that the required documents are found.
  • Click on Make subcorpus in the left menu
  • Type a name of your new subcorpus
  • Select the structure to be included in the subcorpus and in which the search word or phrase was found. The structure codes differ from corpus to corpus but usually:
    doc – the whole document (produces big subcorpora)
    p – the whole paragraph
    s – the whole sentence (produces small subcorpora)
    For a detailed description of the structures used in the corpus, see corpus information.
  • Click on the Save button

To only search the subcorpus, select the corpus again, go to the Text type search and select your corpus.

Delete a subcorpus

Note that you can only delete the subcorpora you created. You cannot delete subcorpora supplied with preloaded corpora.

To delete a subcorpus

  • click Home and select a corpus
  • switch to the Text type search
  • select the subcorpus
  • click the info button
  • delete