1:1 mapping
(intermediate user)

In addition to the basic procedure (which also produces corpora mapped 1:1), parallel corpora can also be created from other sources including vertical files. Sketch Engine supports both 1:1 and m:n mapping. Each language of a parallel corpus can be searched individually as a monolingual corpus or as aligned to one or more corpora (languages).

1:1 mapping

1:1 mapping is a type of alignment where all connected corpora have the exact same number of aligned structures. For example, sentences or paragraphs.

Data preparation

It is a requirement that an alignment structure is present in the corpus. By default, the corpora will be aligned by the align structure. A different alignment structure already present in the corpus (e.g. sentence or paragraph) can be set with the ALIGNSTRUCT corpus attribute.

Here is an example of two source vertical files suitable for processing into parallel corpora. Each contains two sentences. 

Useful tip

It is recommended, especially for small parallel corpora and if possible, that the data should be converted into a tabular data format such as Excel spreadsheet or similar and the basic procedure should be used. The basic procedure is fully automatic, less time consuming and, most importantly,  avoids many potential problems induced by the user.

Corpus 1

 <s>
<align>
This
is
the
first
sentence
.
</align>
</s>
<s>
<align>
This
is
the
second
sentence
.
</align>
</s> 

Corpus 2

 <s>
<align>
This
is
the
first
sentence
in
corpus
2
.
</align>
</s>
<s>
<align>
This
is
the
second
sentence
in
corpus
2
.
</align>
</s> 

A continuous flowing text can also be uploaded provided the structures are present.

corpus 1
 <s><align>This is the first sentence.</align></s><s><align>This is the second sentence.</align></s> 
corpus 2
 <s><align>This is the first sentence in corpus 2.</align></s><s><align>This is the second sentence in corpus 2. </align></s> 

Important!

It is vital that each of the aligned corpora should contain the exactly same number of aligned structures and that the elements appear in exactly the same order.  The alignment mechanism does not employ any ‘linguistic intelligence’ and aligns segments in the order as they appear.

Using the web interface to create a parallel corpus.

  1. log in to Sketch Engine
  2. create two (or more) corpora, make sure all of them contain the same alignment structure, e.g.
  3. set the alignment
    select the corpus, click Manage corpus, then  Configure corpus in the sidebar, tick all corpora which should be aligned and save
  4. repeat step 3 for all aligned corpora in the set

If the alignment structure is not <align>, edit the corpus configuration like this:

  • select the corpus, click Manage corpus, then turn on the expert mode
  • add the following line into the corpus configuration file
    ALIGNSTRUCT "structure"
    (use the actual structure name) and save the form.

Example

Parallel corpora in English, German and Spanish will be uploaded. The corpora will be aligned using structure  in the source data.

1. Create three corpora, one in each language.

ca_parallel1

2. If each corpus consists of multiple files, make sure the alphabetical order of the corresponding files is the same in all corpora, i.e. the first English file must correspond to the first German file and the first Spanish file, the second file to the second files, etc. It may be practical to prefix the file names with a number to avoid aligning incorrect segments.

files English German Spanish
first 01_dog.txt 01_Hund.txt 01_perro.txt
second 02_care.txt 02_Pflege.txt 02_cuidado.txt

3. Make sure the source data contain structure align to mark segments. No segment can be omitted. The order of segments must be the same in all aligned corpora. The structure must be added to the files before uploading them..

You can also use an alignment  software such as hunalign. A manual correction of the output might be necessary.

English – 01_dog.txt

 <align>
I have a nice dog.
</align>
<align>
It runs a lot.
</align> 

German – 01_Hund.txt

 <align>
Ich habe einen schönen Hund.
</align>
<align>
Es läuft sehr viel.
</align> 

Spanish – 01_perro.txt

 <align>
Tengo un buen perro.
</align>
<align>
Corre mucho.
</align> 

4. Upload the source files into the corpora.

ca_parallel3

5. The corresponding align segments in data from all corpora will be automatically connected: the first together, the second together, etc.

ca_parallel2

6. Set the alignment – align each corpus to all other corpora in the set. (Manage corpus – Configure corpus)

ca_parallel4

7. Recompile all three corpora.

8. Open any of the corpora, the search form will offer the aligned corpora. Select one or more.

Concordance form:
ca_parallel5

Concordance result:

ca_parallel6


Attachment

Download: helper script for parallel corpora

Defining aligned corpora via the configuration file

Apart from the user interface, aligned corpora can also be defined via the configuration file. Two new lines must be added into the corpus configuration file of each of the aligned corpora. The first one is

Line 1 is is declaration of the align structure:

STRUCTURE align

since manatee 2.67 An existing structure can be set as the alignment structure using the  ALIGNSTRUCT attribute:

ALIGNSTRUCT "s"

Line 2 is the list of IDs of all corpora that are aligned with the corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

With this setting, Sketch Engine will identify the aligned corpora.