m:n mapping

since manatee 2.67

Sketch Engine includes advanced support for parallel corpora.

In Sketch Engine, parallel corpora work as two (or more) independent corpora. To mark that two corpora are aligned (work as a parallel corpus), a special structure is required in each of the corpora. Starting with Manatee version 2.67, m:n (incl. m:0) alignment is supported.

The name of the alignment structure has to be defined in the ALIGNSTRUCT corpus attribute and defaults to align. In previous versions, the alignment had to be strictly 1:1 and the name of the aligning structure was fixed to align,  i.e. the same number of the align tags was needed in each of the aligned corpora. The treatment of both situations (m:n vs. 1:1) may be however still different:

Data preparation

To use the m:n mapping, a file with mapping definition for each pair of corpora has to be prepared. The file consists of two tab-separated columns, each containing one of the following:

  • two non-negative integers A, B separated by a comma
    denoting a range of the aligning structure IDs from A (inclusive) to B (inclusive).
  • one non-negative integer A
    denoting a single structure ID A
  • -1
    denoting an empty alignment
  • two non-negative integers A:B separated by a colon
    denoting a range of sentences which are aligned 1:1 with the corresponding range in the second column.

Structure ID refers to the order in which structures appeared in the source vertical file and have been indexed. To get the structure ID, e.g. from a unique attribute of the structure, the str2id() method of the PosAttr object can be used, as is illustrated below for the corpus “test”, aligning structure “s” and its attribute “id”:

import manatee
c = manatee.Corpus("test")
a = c.get_attr("s.id")
id = a.str2id("")

A sample mapping file may look as follows:

0    0
1    1,3
-1   4
2,4  5
5    -1
6:8  6:8

This file says that the first (id 0) structure of the first corpus is mapped to the first structure of the second corpus, then the second (id 1) structure of the first corpus is mapped to the second to fourth structure of the second corpus, the fifth (id 4) structure of the second corpus is not mapped anywhere, third to fifth structures are mapped to 6th structure in the second corpus, the 6th is not aligned to any, 7th to 7th, 8th to 8th and 9th to 9th.

Also note that all structures in both corpora must be covered by the mapping.

Changes in corpus configuration for m:n mapping

First, you need to set ALIGNSTRUCT to your mapping structure (if is not “aligned”), e.g.:

ALIGNSTRUCT "s"

Then you define which corpora are aligned with this corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

And finally you provide a mapping definition file for each of this corpus:

ALIGNDEF "/path/to/mapping/file/for/aligned_corpus_id_1,/path/to/mapping/file/for/aligned_corpus_id_2"

ALIGNED and ALIGNDEF must contain the same number of comma-delimited items in the right order (the first item in ALIGNDEF is the definition file with mapping to the first corpus in ALIGNED etc.).

Compilation of corpora with m:n mapping

If you set ALIGNED and ALIGNDEF properly, compilecorp will compile all necessary indices for you. Alternatively, you may manually compile the index for each pair of aligned corpora by running:

mkalign <DEFINITION_FILE_CORPUS1-TO-CORPUS2> <PATH_FOR_THE_MAPPING_FILE>

where <PATH_FOR_THE_MAPPING_FILE> says where the new index is going to be built and should be set to the <CORPUS1_DATA_PATH>/align.<CORPUS2_NAME>.

where says where the new index is going to be built and should be set to the /align..

Helper scripts

The following helper scripts are attached to this page and you may find them useful when creating the alignment definition files:

  • calign.py

Takes a sample XML format (you may want do modify this part of the script), two encoded corporaand the name of the mapping structure attribute. It looks up the structure ID in both corporaaccording to the attribute values in the XML file and produces alignment definition file. The output should be processed by fixgaps.py and compressrng.py.

  • transalign.py

Takes two alignment definition files L2-L1 and L3-L1 and computes a new one L2-L3. The output should be processed by fixgaps.py and compressrng.py.

  • fixgaps.py

Inserts empty alignment into an existing alignment file where gaps are found.

  • compressrng.py

Compresses subsequent empty alignments into one range. May significantly reduce the size of an alignment definition file.

The usual pipeline is:

calign.py <CORPUS_L1> <CORPUS_L2> <MAPPING_STRUCTATTR> <MAPPING_FILE_L1-L2> | ./fixgaps.py | ./compressrng.py

or

transalign.py <MAPPING_L2-L1> <MAPPING_L3-L1> | ./fixgaps.py | ./compressrng.py