Keywords and terms – using Sketch Engine as a term extractor

Sketch Engine doubles as a sophisticated term extractor supporting dozens of languages and outperforming other term extracting tools by using linguistic criteria for term identification. The result contains hardly any noise and does not require manual cleaning. The screenshots on this page are presented as they come out of the system.

To extract terms, use the Keywords and terms feature which works with translation memories (tmx) , a term base or any other document which can be imported easily into Sketch Engine and automatically processed into a corpus ready for term extraction.

A translation memory (TMX) can also be used with bilingual term extraction which returns a bilingual term list with matching translations.

Term extraction is also available for preloaded corpora via the Word list feature.

extraction_terminologyWhy is it useful?

Readily available term base is vital for translators, interpreters and terminologists to maintain consistency in all of their translations. The TMX upload and TBX export makes this easy.

Interpreters can easily generate a list of terminology from a domain they are not familiar with and can prepare for their next interpreting job. This is even possible if the interpreter does not have any materials from the domain or subject area at their disposal. Sketch Engine will find relevant texts on the internet, creates a specialised corpus and extracts terminology.

Terminologists who maintain term bases for their translation agencies, corporate clients or their companies can easily extract terms from the available text resources such as documentation or user manuals. Bilingual lists of terms can be extracted from aligned bilingual material such as a translation memory.

How does it work?

Sketch Engine evaluates words and phrases in your user corpus, translation memory or a term base and decides whether they are specific to the topic of the corpus or whether they are general words or expressions found in texts from many subject areas. Sketch Engine will then produce a list of keywords (single word expressions) and terms (multi-word expressions) with links to a concordance for each item and links to relevant Wikipedia articles. Finally, both keywords and terms can be exported as CSV or TBX for import back into your CAT tool.

How to extract terms

The following procedure can only be used with user corpora. Start by creating your specialized corpus from the web. To create a corpus from your translation memory (tmx), upload the files.

With your corpus ready, click Home (1) and locate the corpus in the My own section (2). Click the wrench button (3) to manage the corpus.

Access your user corpora access your user corpora for term extraction

Extract one-word and multi-word terms from a corpus

Click the Keywords and terms option in the left menu.

The term extraction process starts immediately and usually takes a few seconds. It might take a few minutes for very large translation memories or texts.

The output will show a list of keywords and terms together with Wikipedia links and frequency counts.. Click the frequency to view examples of the terminology in context.

Download

The term extraction result can be downloaded as a term base in the  TBX (TermBase eXchange) format or CSV and imported back into a CAT tool or other software such as Excel, Google Sheets, Calc (OpenOffice) or a specialized terminology management system.

What do the black and green terms mean?

Your results may show some terms and keywords in green to flag up those that were used as seed words when downloading relevant texts from the web using the WebBootCaT tool. This is useful to know if you want to use your result screen as a source for seed words for expanding your corpus using another WebBootCaT procedure.

Term links to Wikipedia

Every word has a link to the five most relevant pages on Wikipedia.

extraction_wiki

Extraction options

The extraction options are available only after the first extraction has been made. Click the Change extraction options link above the extracted terms to access the settings.

Change term extraction options

(1) this reference for single-word terms corpus serves as an example of a general text to compare the specialised user corpus too. The corpus selected by default is the recommended option. (for experts: the reference corpus must have the same term grammar as your corpus).

(2) determines to what extent different word forms of the same word should be merged together. Available options: lc, lemma_lc, lemma, word form

(3) use a higher number to include more high-frequency and therefore more general words into the list (see more info about simple math)

(4) recommended leave unchecked, if checked it will exclude high-frequency grammar words from terms, however, terminology often contains phrases including these words so it might be counter-productive

(5) excludes tokens containing non-letters and non-digits, e.g. containing exclamation marks, recommended setting is unchecked or designations of products might be excluded

(6) when checked it will make sure that tokens consisting of digits only will be excluded, e.g. phone numbers will be excluded

(7) sets the minimum frequency for a term to be included, only change if the extraction produces unwanted results

(8) limits the number of extracted keywords, useful especially if (3) is set to a higher value

(9) analogical to (1) but for multi-word terms, only change the pre-set corpus if you have a good reason for doing so

(10) limits the number of extracted terms (multi-word), useful especially if (3) is set to a higher value

Extraction option

You may tune the result of extracting terminology in the Change extraction options form at the bottom of the same page (below the result). There is possible to choose a reference corpus (must have the same term grammar) for keywords and terms. The size of a reference corpus influences the time needed for the processing. Corpus attribute specifies searching attribute (word, lemma, …), the default setting is a word in lowercase form (ABC and abc are treated the same). We recommend you to use this default option. Simple math accentuates low-frequency keywords if set to a low value, whereas setting a higher value gives you higher frequency keywords. The default value is 1, see more info about simple math.

There is possible to choose a reference corpus (must have the same term grammar) for keywords and terms. The size of a reference corpus influences the time needed for the processing. Corpus attribute specifies searching attribute (word, lemma, …), the default setting is a word in lowercase form (ABC and abc are treated the same). We recommend you to use this default option. Simple math accentuates low-frequency keywords if set to a low value, whereas setting a higher value gives you higher frequency keywords. The default value is 1, see more info about simple math.

How to import CSV into your software

How to import CSV file into Excel/Calc in order to separate the data into columns?

Open your CSV file in Excel, click the Data tab and then click the Text to Columns. You see the Text Import Wizard where you choose Delimited, click Next and choose Delimiters, in this case, it is Tab. Data preview below should show you separated data into columns. Finally, click the Finish button. See the Text to Columns Wizard video tutorial or Import a text file by connecting (both on the official Office website). If you use LibreOffice, see these Importing CSV Files and Text to Columns pages.

Details about term extraction

Keywords and terms are sorted by score, which depends on the keyness score (see the simple math).

Displayed terms are in a gender lemma (the form of lemma which respects gender of the head word). See the bibliography below for more information and for the algorithm behind our term extraction.

Bibliography

Adam Kilgarriff (2013). Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine. In Proceedings ASLIB 35th Translating and the Computer Conference, London, May 2013.

Adam Kilgarriff, Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2014). Finding Terms in Corpora for Many Languages with the Sketch Engine. In Proceedings of the Demonstrations at the 14th Conference the European Chapter of the Association for Computational Linguistics, Sweden, April 2014, pp. 53–56.