In a nutshell

  • First, create two monolingual corpora (see previous sections).
  • Then make the two corpora aligned. Please, note there has to be a structure defining the alignment in both corpora. This step is described in the  '1:1 mapping' section.
    • 1:1 mapping means that all aligned corpora have the same number aligned structures (e.g. sentences)
  • Finally, when you open the aligned corpus, the corpus manager will offer querying the other aligned corpus too (provided both corpora are correctly uploaded, aligned and compiled).
  • If you encounter problems or if your data do not contain the required structure, upload your corpora nevertheless (or send us a sample of your data) and the support will help you.

Sketch Engine includes advanced support for parallel corpora. In the Sketch Engine, parallel corpora work as two (or more) independent corpora. To mark that two corpora are aligned (work as a parallel corpus), a special structure is required in each of the corpora. Starting with Manatee version 2.67 (version of system), we support m:n (incl. m:0) alignment, the name of this structures is defined in the ALIGNSTRUCT corpus attribute and defaults to align. In previous versions, the alignment had to be strictly 1:1 and the name of the aligning structure was fixed to align, i.e. the same number of the align tags was needed in each of the aligned corpora. The treatment of both situations (m:n vs. 1:1) may be however still different:

m:n mapping (Manatee 2.67 and higher)

Data preparation

To be able to use m:n mapping, you need to provide a file with mapping definition for each pair of corpora. The file consists of two tab-separated columns, each containing one of the following:

  • two non-negative integers A, B separated by a comma; denoting a range of the aligning structure IDs from A (inclusive) to B (inclusive).
  • one non-negative integer A; denoting a single structure ID A
  • -1; denoting an empty alignment
  • two non-negative integers A:B separated by a colon; denoting a range of sentences which are aligned 1:1 with the corresponding range in the second column.

By structure ID, we refer here to the order in which structures appeared in the source vertical file and have been indexed. To get the structure ID e.g. from a unique attribute of the structure, the str2id() method of the PosAttr object can be used, as is illustrated below for the corpus "test", aligning structure "s" and its attribute "id":

import manatee
c = manatee.Corpus("test")
a = c.get_attr("s.id")
id = a.str2id("")

A sample mapping file may look as follows:

0    0
1    1,3
-1   4
2,4  5
5    -1
6:8  6:8

This file says that the first (id 0) structure of the first corpus is mapped to the first structure of the second corpus, then the second (id 1) structure of the first corpus is mapped to the second to fourth structure of the second corpus, the fifth (id 4) structure of the second corpus is not mapped anywhere, third to fifth structures are mapped to 6th structure in the second corpus, the 6th is not aligned to any, 7th to 7th, 8th to 8th and 9th to 9th.

Changes in corpus configuration for m:n mapping

First, you need to set ALIGNSTRUCT to your mapping structure (if is not "aligned"), e.g.:

ALIGNSTRUCT "s"

Then you define which corpora are aligned with this corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

And finally you provide a mapping definition file for each of this corpus:

ALIGNDEF "/path/to/mapping/file/for/aligned_corpus_id_1,/path/to/mapping/file/for/aligned_corpus_id_2"

ALIGNED and ALIGNDEF must contain the same number of comma-delimited items in the right order (the first item in ALIGNDEF is the definition file with mapping to the first corpus in ALIGNED etc.).

Compilation of corpora with m:n mapping

If you set ALIGNED and ALIGNDEF properly, compilecorp will compile all necessary indices for you. Alternatively, you may manually compile the index for each pair of aligned corpora by running:

mkalign

where says where the new index is going to be built and should be set to the /align..

Helper scripts

The following helper scripts are attached to this page and you may find them useful when creating the alignment definition files:

  • calign.py

Takes a sample XML format (you may want do modify this part of the script), two encoded corporaand the name of the mapping structure attribute. It looks up the structure ID in both corporaaccording to the attribute values in the XML file and produces alignment definition file. The output should be processed by fixgaps.py and compressrng.py.

  • transalign.py

Takes two alignment definition files L2-L1 and L3-L1 and computes a new one L2-L3. The output should be processed by fixgaps.py and compressrng.py.

  • fixgaps.py

Inserts empty alignment into an existing alignment file where gaps are found.

  • compressrng.py

Compresses subsequent empty alignments into one range. May significantly reduce the size of an alignment definition file.

The usual pipeline is:

calign.py | ./fixgaps.py | ./compressrng.py

or

transalign.py | ./fixgaps.py | ./compressrng.py

1:1 mapping

1:1 mapping is a type of alignment where all connected corpora have the same number of aligned structures. For example, in the case of a sentence as an aligned structure, all corpora have to have the same of sentences.

Data preparation

A small example of two source vertical files that are suitable for processing as parallel corpora with the same number of sentences:

Corpus 1:



This
is
the
first
sentence
.




This
is
the
second
sentence
.


Corpus 2:



This
is
the
first
sentence
in
corpus
2
.




This
is
the
second
sentence
in
corpus
2
.


After that compile the corpus and set ALIGNSTRUCT to "s". For more details, see the next section.

Changes in corpus configuration for 1:1 mapping

Two new lines need to be added into the corpus configuration file of each of the aligned corpora. The first one is declaration of the align structure:

STRUCTURE   align

From manatee 2.67 onwards you can just set ALIGNSTRUCT to an existing structure:

ALIGNSTRUCT "s"

The second line is the list of IDs of all corpora that are aligned with the corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

With this setting, the Sketch Engine will find the aligned corpora and will be able to display parallel results.

Defining 1:1 parallel corpora using the web interface

Parallel corpora with 1:1 mapping can also be set up using the web interface. Follow these instructions:

  1. After logging in Sketch Engine, create two (or more) corpora containing structure align (or other structure suitable for the alignment).
  2. Set the alignment: Click on the corpus name (on the "My corpora" page), select "Configure corpus" in the sidebar, use the ALIGNED field to select the corpora to align with and save the form.
  3. If the alignment structure is not align, alter configuration of the corpus: Switch the corpus to expert mode, select "Configure corpus" in the sidebar, add ALIGNSTRUCT "structure" (use the actual structure name) and save the form.
  4. Repeat steps 2 and 3 for all aligned corpora in the set.
  5. All steps are done automatically in case the corpora are created by uploading source data in a TMX file.

Example

Parallel corpora in English, German and Spanish will be uploaded. The corpora will be aligned using structure align in the source data.

1. Create three corpora, one in each language.

ca_parallel1

2. Each corpus may consist of multiple files. Make sure the alphabetical order of the corresponding files is the same in all corpora, i.e. the first English file must correspond to the first German file and the first Spanish file, the second file to the second files, etc. Therefore, it is safe to prefix file names with a number to avoid aligning unrelated segments.

files English German Spanish
first 01_dog.txt 01_Hund.txt 01_perro.txt
second 02_care.txt 02_Pflege.txt 02_cuidado.txt

3. Make sure the source data contain structure align to mark segments. No segment can be omitted. The order of segments matter – it must be the same in all aligned corpora. The structure must be added to the files before uploading them to the corpus.

You can also use a software for alignment such as hunalign. A manual correction of the output might be necessary.

English – 01_dog.txt

I have a nice dog.


It runs a lot.

German – 01_Hund.txt

Ich habe einen schönen Hund.


Es läuft sehr viel.

Spanish – 01_perro.txt

Tengo un buen perro.


Corre mucho.

4. Upload the source files into the corpora.

ca_parallel3

5. The corresponding align segments in data from all corpora will be automatically connected: the first together, the second together, etc.

ca_parallel2

6. Set the alignment – align each corpus to all other corpora in the set.

ca_parallel4

7. Recompile all three corpora.

8. Open any of the corpora in the corpus manager.

Concordance form:
ca_parallel5

Concordance result:

ca_parallel6


Attachment

Download: helper script for parallel corpora