Sketch Engine includes advanced support for parallel corpora. In the Sketch Engine, parallel corpora work as two (or more) independent corpora. To mark that two corpora are aligned (work as a parallel corpus), a special structure is required in each of the corpora. Starting with Manatee version 2.67 (version of system), we support m:n (incl. m:0) alignment, the name of this structures is defined in the ALIGNSTRUCT corpus attribute and defaults to align. In previous versions, the alignment had to be strictly 1:1 and the name of the aligning structure was fixed to align, i.e. the same number of the align tags was needed in each of the aligned corpora. The treatment of both situations (m:n vs. 1:1) may be however still different:
m:n mapping (Manatee 2.67 and higher)
To be able to use m:n mapping, you need to provide a file with mapping definition for each pair of corpora. The file consists of two tab-separated columns, each containing one of the following:
- two non-negative integers A, B separated by a comma; denoting a range of the aligning structure IDs from A (inclusive) to B (inclusive).
- one non-negative integer A; denoting a single structure ID A
- -1; denoting an empty alignment
- two non-negative integers A:B separated by a colon; denoting a range of sentences which are aligned 1:1 with the corresponding range in the second column.
By structure ID, we refer here to the order in which structures appeared in the source vertical file and have been indexed. To get the structure ID e.g. from a unique attribute of the structure, the str2id() method of the PosAttr object can be used, as is illustrated below for the corpus "test", aligning structure "s" and its attribute "id":
c = manatee.Corpus("test")
a = c.get_attr("s.id")
id = a.str2id("")
A sample mapping file may look as follows:
This file says that the first (id 0) structure of the first corpus is mapped to the first structure of the second corpus, then the second (id 1) structure of the first corpus is mapped to the second to fourth structure of the second corpus, the fifth (id 4) structure of the second corpus is not mapped anywhere, third to fifth structures are mapped to 6th structure in the second corpus, the 6th is not aligned to any, 7th to 7th, 8th to 8th and 9th to 9th.
Also note that all structures in both corpora must be covered by the mapping.
Changes in corpus configuration for m:n mapping
First, you need to set ALIGNSTRUCT to your mapping structure (if is not "aligned"), e.g.:
Then you define which corpora are aligned with this corpus:
And finally you provide a mapping definition file for each of this corpus:
ALIGNED and ALIGNDEF must contain the same number of comma-delimited items in the right order (the first item in ALIGNDEF is the definition file with mapping to the first corpus in ALIGNED etc.).
Compilation of corpora with m:n mapping
If you set ALIGNED and ALIGNDEF properly, compilecorp will compile all necessary indices for you. Alternatively, you may manually compile the index for each pair of aligned corpora by running: