Bilingual term extraction is an extension of term extraction. It is available through the word list link in the main menu.

Data requirements

Parallel texts aligned on paragraph or sentence level are needed. Upload your translation memory TMX file and Sketch Engine will process it automatically and convert it into aligned corpora. Aligned texts in more than two languages can be uploaded.

Deatils about the extraction process

The bilingual extraction of terms is a two-step process. First, terms are extracted within each language and lists of candidate terms are produced. In the second step, Sketch Engine looks for candidate pairs which tend to appear in the same aligned segments. The resulting list of candidate pairs (terms in two languages) is then presented to users.

Extracting terminology step by step

  • Click Home and then Upload TMX and upload your file
  • Your uploaded TMX file will appear as two separate corpora, one corpus for each language.
  • On the corpus selection screen, start by selecting either corpus (language).
  • Click Word list in the left menu.
  • the other language (or a list of languages in the case of multilingual tmx) will be listed at the end of the main menu
  • clicking a language will launch the the bilingual term extraction process
  • once the terms are extracted, use the Save as TBX or Save as TXT item in the left menu to edit them further and/or import them into a CAT tool or a terminology management system


See below e.g. DGT English-Spanish results (the link will open the result screen in the Sketch Engine interface, works only for registered users).


Notes on sorting

From experience with such large data (DGT has 74 million tokens), sorting candidates by co-occurrence frequency yields better results, however, the sorting can be changed in the interface by clicking the column headers. The results are a good starting point when preparing a translation term base from scratch.

Another example is a small corpus of English-French with UNICEF-related texts. Here the extracted terms are sorted by logDice (a co-occurrence statistics) works better than in the previous example. For registered users, see this whole example.



Bilingual Terminology Extraction in Sketch Engine. Vít Baisa, Barbora Ulipová, and Michal Cukr. In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, the Czech Republic, December 2015, pp. 61–67.