corpus from spreadsheet
(basic user)

Parallel corpus from tabular data

(basic user)

The simplest way to create a parallel corpus is to upload data in a tabular format such as a spreadsheet (Excel), TMX, XML, XLIFF or other similar formats.

Spreadsheet format requirements

Spreadsheets must contain language names in the first row and then aligned segments (e.g. sentences) side by side. Each column with data is treated as data for a different language, i.e. spreadsheet for 2 languages must only contain 2 columns of data, all other columns must be empty!

Follow these steps

  • log in to Sketch Engine
  • click Upload TMX or XLS
    other supported formats: XLIFF, XML, TSV, TAB, xlsx
    (if xlsx does not upload correctly, try opening the file in Excel and save as Excel 97-2003 Workbook)
  • type the corpus name and choose the file
  • on the following screen, check the languages were identified correctly
  • click Create

Each language in the source file will be processed into a separate monolingual corpus and aligned with the corresponding corpus in the other language(s).

Searching

To search the corpus as a parallel corpus, first select the corpus in the language that should appear on the left and then, when setting the search criteria, select the other language(s). Multiple languages can be selected to display a multilingual concordance.

corpus with 1:1 mapping
(intermedite user)

1:1 mapping

(intermediate user)

In addition to the basic procedure (which also produces corpora mapped 1:1), parallel corpora can also be created from other sources including vertical files. Sketch Engine supports both 1:1 and m:n mapping. Each language of a parallel corpus can be searched individually as a monolingual corpus or as aligned to one or more corpora (languages).

1:1 mapping

1:1 mapping is a type of alignment where all aligned corpora have the exact same number of aligned structures. Typically, the same number of sentences or paragraphs, i.e. each sentence in a corpus has a matching sentence in the other corpus.

Data preparation

It is a requirement that an alignment structure is present in the corpus. By default, the corpora will be aligned by the align structure. A different alignment structure already present in the corpus (e.g. sentence or paragraph) can be set with the ALIGNSTRUCT corpus attribute.

Here is an example of two source vertical files suitable for processing into parallel corpora. Each contains two sentences. 

Corpus 1

Corpus 2

A continuous flowing text can also be uploaded provided the structures are present.

Corpus 1

Corpus 2

Using the web interface to create a parallel corpus.

  1. log in to Sketch Engine
  2. create two (or more) corpora, make sure all of them contain the same alignment structure, e.g.
  3. set the alignment
    select the corpus, click Manage corpus, then  Configure corpus in the sidebar, tick all corpora which should be aligned and save
  4. repeat step 3 for all aligned corpora in the set

If the alignment structure is not <align>, edit the corpus configuration like this:

  • select the corpus, click Manage corpus, then turn on the expert mode
  • add the following line into the corpus configuration file
    ALIGNSTRUCT "structure"
    (use the actual structure name) and save the form.

Example

Parallel corpora in English, German and Spanish will be uploaded. The corpora will be aligned using structure  in the source data.

1. Create three corpora, one in each language.

ca_parallel1

2. If each corpus consists of multiple files, make sure the alphabetical order of the corresponding files is the same in all corpora, i.e. the first English file must correspond to the first German file and the first Spanish file, the second file to the second files, etc. It may be practical to prefix the file names with a number to avoid aligning incorrect segments.

files English German Spanish
first 01_dog.txt 01_Hund.txt 01_perro.txt
second 02_care.txt 02_Pflege.txt 02_cuidado.txt

3. Make sure the source data contain structure align to mark segments. No segment can be omitted. The order of segments must be the same in all aligned corpora. The structure must be added to the files before uploading them..

You can also use an alignment  software such as hunalign. A manual correction of the output might be necessary.

English – 01_dog.txt

German – 01_Hund.txt

Spanish – 01_perro.txt

4. Upload the source files into the corpora.ca_parallel3

5. The corresponding align segments in data from all corpora will be automatically connected: the first together, the second together, etc.

ca_parallel2

6. Set the alignment – align each corpus to all other corpora in the set. (Manage corpus - Configure corpus)

ca_parallel4

7. Recompile all three corpora.

8. Open any of the corpora, the search form will offer the aligned corpora. Select one or more.

Concordance form:
ca_parallel5

Concordance result:

ca_parallel6


Attachment

Download: helper script for parallel corpora

Defining aligned corpora via the configuration file

Apart from the user interface, aligned corpora can also be defined via the configuration file. Two new lines must be added into the corpus configuration file of each of the aligned corpora. The first one is

Line 1 is is declaration of the align structure:

STRUCTURE align

since manatee 2.67 An existing structure can be set as the alignment structure using the  ALIGNSTRUCT attribute:

ALIGNSTRUCT "s"

Line 2 is the list of IDs of all corpora that are aligned with the corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

With this setting, Sketch Engine will identify the aligned corpora.

corpus with M:N mapping
(advanced user)

m:n mapping

since manatee 2.67

Sketch Engine includes advanced support for parallel corpora. In previous versions, the alignment had to be strictly 1:1 and the name of the aligning structure was fixed to align,  i.e. the same number of the align tags was needed in each of the aligned corpora. Starting with Manatee version 2.67, m:n (incl. m:0) alignment is supported. The name of the alignment structure has to be defined in the ALIGNSTRUCT corpus attribute and defaults to align.

Data preparation

To use the m:n mapping, a file with mapping definition for each pair of corpora has to be prepared. The file consists of two tab-separated columns, each containing one of the following:

  • two non-negative integers A, B separated by a comma
    denoting a range of the aligning structure IDs from A (inclusive) to B (inclusive).
  • one non-negative integer A
    denoting a single structure ID A
  • -1
    denoting an empty alignment
  • two non-negative integers A:B separated by a colon
    denoting a range of sentences which are aligned 1:1 with the corresponding range in the second column.

Structure ID refers to the order in which structures appeared in the source vertical file and have been indexed. To get the structure ID, e.g. from a unique attribute of the structure, the str2id() method of the PosAttr object can be used, as is illustrated below for the corpus "test", aligning structure "s" and its attribute "id":

import manatee
c = manatee.Corpus("test")
a = c.get_attr("s.id")
id = a.str2id("")

A sample mapping file may look as follows:

0    0
1    1,3
-1   4
2,4  5
5    -1
6:8  6:8

This file says that the first (id 0) structure of the first corpus is mapped to the first structure of the second corpus, then the second (id 1) structure of the first corpus is mapped to the second to fourth structure of the second corpus, the fifth (id 4) structure of the second corpus is not mapped anywhere, third to fifth structures are mapped to 6th structure in the second corpus, the 6th is not aligned to any, 7th to 7th, 8th to 8th and 9th to 9th.

Also note that all structures in both corpora must be covered by the mapping.

Changes in corpus configuration for m:n mapping

First, you need to set ALIGNSTRUCT to your mapping structure (if is not "aligned"), e.g.:

ALIGNSTRUCT "s"

Then you define which corpora are aligned with this corpus:

ALIGNED "aligned_corpus_id_1,aligned_corpus_id_2"

And finally you provide a mapping definition file for each of this corpus:

ALIGNDEF "/path/to/mapping/file/for/aligned_corpus_id_1,/path/to/mapping/file/for/aligned_corpus_id_2"

ALIGNED and ALIGNDEF must contain the same number of comma-delimited items in the right order (the first item in ALIGNDEF is the definition file with mapping to the first corpus in ALIGNED etc.).

Compilation of corpora with m:n mapping

If you set ALIGNED and ALIGNDEF properly, compilecorp will compile all necessary indices for you. Alternatively, you may manually compile the index for each pair of aligned corpora by running:

where

says where the new index is going to be built and should be set to the

Helper scripts

These help scripts can be useful when creating the alignment definition files.