Sketch Engine includes advanced support for parallel corpora. In the Sketch Engine, parallel corpora work as two (or more) independent corpora. To mark that two corpora are aligned (work as a parallel corpus), a special structure is required in each of the corpora. Starting with Manatee version 2.67 (version of system), we support m:n (incl. m:0) alignment, the name of this structures is defined in the ALIGNSTRUCT corpus attribute and defaults to align. In previous versions, the alignment had to be strictly 1:1 and the name of the aligning structure was fixed to align, i.e. the same number of the align tags was needed in each of the aligned corpora. The treatment of both situations (m:n vs. 1:1) may be however still different:
m:n mapping (Manatee 2.67 and higher)
To be able to use m:n mapping, you need to provide a file with mapping definition for each pair of corpora. The file consists of two tab-separated columns, each containing one of the following:
- two non-negative integers A, B separated by a comma; denoting a range of the aligning structure IDs from A (inclusive) to B (inclusive).
- one non-negative integer A; denoting a single structure ID A
- -1; denoting an empty alignment
- two non-negative integers A:B separated by a colon; denoting a range of sentences which are aligned 1:1 with the corresponding range in the second column.
By structure ID, we refer here to the order in which structures appeared in the source vertical file and have been indexed. To get the structure ID e.g. from a unique attribute of the structure, the str2id() method of the PosAttr object can be used, as is illustrated below for the corpus "test", aligning structure "s" and its attribute "id":
c = manatee.Corpus("test")
a = c.get_attr("s.id")
id = a.str2id("")
A sample mapping file may look as follows:
This file says that the first (id 0) structure of the first corpus is mapped to the first structure of the second corpus, then the second (id 1) structure of the first corpus is mapped to the second to fourth structure of the second corpus, the fifth (id 4) structure of the second corpus is not mapped anywhere, third to fifth structures are mapped to 6th structure in the second corpus, the 6th is not aligned to any, 7th to 7th, 8th to 8th and 9th to 9th.
Changes in corpus configuration for m:n mapping
First, you need to set ALIGNSTRUCT to your mapping structure (if is not "aligned"), e.g.:
Then you define which corpora are aligned with this corpus:
And finally you provide a mapping definition file for each of this corpus:
ALIGNED and ALIGNDEF must contain the same number of comma-delimited items in the right order (the first item in ALIGNDEF is the definition file with mapping to the first corpus in ALIGNED etc.).
Compilation of corpora with m:n mapping
If you set ALIGNED and ALIGNDEF properly, compilecorp will compile all necessary indices for you. Alternatively, you may manually compile the index for each pair of aligned corpora by running:
where says where the new index is going to be built and should be set to the /align..
The following helper scripts are attached to this page and you may find them useful when creating the alignment definition files:
Takes a sample XML format (you may want do modify this part of the script), two encoded corporaand the name of the mapping structure attribute. It looks up the structure ID in both corporaaccording to the attribute values in the XML file and produces alignment definition file. The output should be processed by fixgaps.py and compressrng.py.
Takes two alignment definition files L2-L1 and L3-L1 and computes a new one L2-L3. The output should be processed by fixgaps.py and compressrng.py.
Inserts empty alignment into an existing alignment file where gaps are found.
Compresses subsequent empty alignments into one range. May significantly reduce the size of an alignment definition file.
The usual pipeline is:
calign.py | ./fixgaps.py | ./compressrng.py
transalign.py | ./fixgaps.py | ./compressrng.py
1:1 mapping is a type of alignment where all connected corpora have the same number of aligned structures. For example, in the case of a sentence as an aligned structure, all corpora have to have the same of sentences.
A small example of two source vertical files that are suitable for processing as parallel corpora with the same number of sentences:
After that compile the corpus and set ALIGNSTRUCT to "s". For more details, see the next section.
Changes in corpus configuration for 1:1 mapping
Two new lines need to be added into the corpus configuration file of each of the aligned corpora. The first one is declaration of the align structure:
From manatee 2.67 onwards you can just set ALIGNSTRUCT to an existing structure:
The second line is the list of IDs of all corpora that are aligned with the corpus:
With this setting, the Sketch Engine will find the aligned corpora and will be able to display parallel results.
Defining 1:1 parallel corpora using the web interface
Parallel corpora with 1:1 mapping can also be set up using the web interface. Follow these instructions:
- After logging in Sketch Engine, create two (or more) corpora containing structure align (or other structure suitable for the alignment).
- Set the alignment: Click on the corpus name (on the "My corpora" page), select "Configure corpus" in the sidebar, use the ALIGNED field to select the corpora to align with and save the form.
- If the alignment structure is not align, alter configuration of the corpus: Switch the corpus to expert mode, select "Configure corpus" in the sidebar, add ALIGNSTRUCT "structure" (use the actual structure name) and save the form.
- Repeat steps 2 and 3 for all aligned corpora in the set.
- All steps are done automatically in case the corpora are created by uploading source data in a TMX file.
Parallel corpora in English, German and Spanish will be uploaded. The corpora will be aligned using structure align in the source data.
1. Create three corpora, one in each language.
2. Each corpus may consist of multiple files. Make sure the alphabetical order of the corresponding files is the same in all corpora, i.e. the first English file must correspond to the first German file and the first Spanish file, the second file to the second files, etc. Therefore, it is safe to prefix file names with a number to avoid aligning unrelated segments.
3. Make sure the source data contain structure align to mark segments. No segment can be omitted. The order of segments matter – it must be the same in all aligned corpora. The structure must be added to the files before uploading them to the corpus.
You can also use a software for alignment such as hunalign. A manual correction of the output might be necessary.
English – 01_dog.txt
I have a nice dog.
It runs a lot.
German – 01_Hund.txt
Ich habe einen schönen Hund.
Es läuft sehr viel.
Spanish – 01_perro.txt
Tengo un buen perro.
4. Upload the source files into the corpora.
5. The corresponding align segments in data from all corpora will be automatically connected: the first together, the second together, etc.
6. Set the alignment – align each corpus to all other corpora in the set.
7. Recompile all three corpora.
8. Open any of the corpora in the corpus manager.
Download: helper script for parallel corpora