Keywords and terms

Keywords and terms extract terminology used in your corpus. Even your translation memory or a term base can be imported easily into Sketch Engine and processed as a corpus for term extraction.

Uploading your translation memory as TMX can be used with bilingual term extraction which returns a bilingual term list with matching translations.

Term extraction is also available in preloaded corpora via the Word list feature.

extraction_terminologyWhy is it useful?

Readily available term base is vital for translators, interpreters and terminologists.

It is vital for translators and to build their term bases to maintain consistency in all of their translations.

Interpreters can easily generate a list of terminology from a domain they are not familiar with and can prepare for their next interpreting. This is even possible if the interpreter does not have any materials from the domain or subject area at their disposal. Sketch Engine will find relevant texts on the internet, creates a specialised corpus and extracts terminology.

Terminologists who maintain term bases for their translation agencies, corporate clients or their companies can easily extract terminology from the available text resources such as documentation or user manuals. Bilingual lists of terms can be extracted from aligned bilingual material such as a translation memory.

How does it work?

Sketch Engine evaluates lexical units in your corpus, translation memory or a term base and decides whether they are specific to the topic of the corpus or whether they are general words or expressions found in texts from many subject areas. Sketch Engine will then produce a list of keywords (single word expressions) and terms (multi-word expressions) with links to a concordance for each item and links to relevant Wikipedia articles. Finally, both keywords and terms can be exported as CSV or TBX for import into your CAT tool.

How to extract terminology

The following procedure can only be used with user corpora. Start by creating your specialized corpus from the web, from your translation memory for by uploading files.

With your corpus ready, click Home (1) and locate the corpus in the My own section (2). Click the wrench button (3) to manage the corpus.

Access your user corpora Access your user corpora for term extraction[/caption]

Extract one-word and multi-word terms from a corpus

Click the Keywords and terms option in the left menu.

The process starts immediately and usually takes a few seconds. It might take a few minutes for very large corpora.

The output will show a list of keywords and terms together with tick boxes, Wikipedia links and frequency counts.. Click the frequency to view examples of the terminology in context.

Download

You can download the result of terminology extraction in format TBX (TermBase eXchange) and CSV and import it to your CAT tool or other software such as Excel, Google Sheets or Calc (OpenOffice).

What do the black and green terms mean?

Your results may show some terms and keywords in green to flag up those that were used as seed words when downloading relevant texts from the web using the WebBootCaT tool. This is useful to know if you want to use your result screen as a source for seed words for expanding your corpus using another WebBootCaT procedure.

Term links to Wikipedia

Every word has a link to the five most relevant pages on Wikipedia.

extraction_wiki

Extraction options

The extraction options are available only after the first extraction has been made. Click the Change extraction options link above the extracted terms to access the settings.

Change term extraction options

(1) this reference for single-word terms corpus serves as an example of a general text to compare the specialised user corpus too. The corpus selected by default is the recommended option. (for experts: the reference corpus must have the same term grammar as your corpus).

(2) determines to what extent different word forms of the same word should be merged together. Available options: lc, lemma_lc, lemma, word form

(3) use a higher number to include more high-frequency and therefore more general words into the list (see more info about simple math)

(4) recommended leave unchecked, if checked it will exclude high-frequency grammar words from terms, however, terminology often contains phrases including these words so it might be counter-productive

(5) excludes tokens containing non-letters and non-digits, e.g. containing exclamation marks, recommended setting is unchecked or designations of products might be excluded

(6) when checked it will make sure that tokens consisting of digits only will be excluded, e.g. phone numbers will be excluded

(7) sets the minimum frequency for a term to be included, only change if the extraction produces unwanted results

(8) limits the number of extracted keywords, useful especially if (3) is set to a higher value

(9) analogical to (1) but for multi-word terms, only change the pre-set corpus if you have a good reason for doing so

(10) limits the number of extracted terms (multi-word), useful especially if (3) is set to a higher value

Extraction option

You may tune the result of extracting terminology in the Change extraction options form at the bottom of the same page (below the result). There is possible to choose a reference corpus (must have the same term grammar) for keywords and terms. The size of a reference corpus influences the time needed for the processing. Corpus attribute specifies searching attribute (word, lemma, …), the default setting is a word in lowercase form (ABC and abc are treated the same). We recommend you to use this default option. Simple math accentuates low-frequency keywords if set to a low value, whereas setting a higher value gives you higher frequency keywords. The default value is 1, see more info about simple math.

There is possible to choose a reference corpus (must have the same term grammar) for keywords and terms. The size of a reference corpus influences the time needed for the processing. Corpus attribute specifies searching attribute (word, lemma, …), the default setting is a word in lowercase form (ABC and abc are treated the same). We recommend you to use this default option. Simple math accentuates low-frequency keywords if set to a low value, whereas setting a higher value gives you higher frequency keywords. The default value is 1, see more info about simple math.

How to import CSV into your software

How to import CSV file into Excel/Calc in order to separate the data into columns?

Open your CSV file in Excel, click the Data tab and then click the Text to Columns. You see the Text Import Wizard where you choose Delimited, click Next and choose Delimiters, in this case, it is Tab. Data preview below should show you separated data into columns. Finally, click the Finish button. See the Text to Columns Wizard video tutorial or Import a text file by connecting (both on the official Office website). If you use LibreOffice, see these Importing CSV Files and Text to Columns pages.

Details about term extraction

Keywords and terms are sorted by score, which depends on the keyness score (see the simple math).

Displayed terms are in a gender lemma (the form of lemma which respects gender of the head word). See the bibliography below for more information and for the algorithm behind our term extraction.

Bibliography

Adam Kilgarriff (2013). Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine. In Proceedings ASLIB 35th Translating and the Computer Conference, London, May 2013.

Adam Kilgarriff, Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2014). Finding Terms in Corpora for Many Languages with the Sketch Engine. In Proceedings of the Demonstrations at the 14th Conference the European Chapter of the Association for Computational Linguistics, Sweden, April 2014, pp. 53–56.