What is a virtual corpus?

A virtual corpus is a corpus that is not compiled from any source vertical files, but instead a set of corpora or corpus parts (i.e. subcorpora) is specified as the underlying source data.

The virtual corpus functionality is available from Manatee version 2.88.

Why to use virtual corpora?

If you need to put together various corpora or subcorpora, you may build a virtual corpus. It is easier and faster to prepare and setup the virtual corpus according to source vertical files than to create a new one. The resulted virtual corpus takes only a fraction of disk space of what its non-virtual corpus counterpart would.

How to setup a virtual corpus?

1. Create a virtual corpus definition file

A virtual corpus definition file is a plain text file specifying which corpora will be used to create the virtual corpus. It consists of a list of parts, each part having the following format:

=<CORPUS_NAME>
<from_position>,<end_position>
<from_position>,<end_position>
...

This says that a part of corpus <CORP_NAME> should be included, starting with <from_position> (inclusively) and ending with <end_position> (exclusively). The dollar sign (‘$’) can be put instead of <end_position> denoting the end of the corpus.

Example:

=bnc
1000,2500
3500,4500

=susanne
0,$

Virtual corpus using this definition file would consist of the whole “susanne” corpus and two parts of the “bnc” corpus (tokens 1000–2500 and 3500–4500).

2. Create a virtual corpus configuration file

This is the configuration file as in the case of non-virtual corpora with a couple of specifics:

  • You can start by amending some of the configuration files of those corpora that the virtual corpus consist of, however, you have to make sure that all attributes and structures specified in the virtual corpus configuration file are present in ALL parts of it.
  • Instead of specifying the input vertical source file by the VERTICAL attribute, use the VIRTUAL attribute which should contain the full path to the virtual corpus definition file created in step 1.

Example:

NAME "Susanne + Bnc"
PATH /corpora/manatee/virtual_english
VIRTUAL /corpora/virtdef/virtual_english # this is the virtual corpus definition file

ATTRIBUTE word
ATTRIBUTE tag
ATTRIBUTE lemma
...

3. Compile the virtual corpus

Compiling a virtual corpus is done by using the mkvirt command instead of encodevert:

>mkvirt
Usage: mkvirt [-d] [-a ATTRLIST] CORPUS
Options:
-d           skip creating dynamic attributes
-a ATTRLIST  compile only attributes in comma delimited
             ATTRLIST, may contain <struct>.<attr> attributes

You can also use the compilecorp wrapper script as usual – it will detect the VIRTUAL attribute and automatically use mkvirt instead of encodevert.