OneClick Terms

A simple term extractor interface giving easy access to terminology extraction functionality.

This is a summary of Sketch Engine features aimed at terminologists.

Features for terminology

  • Term extraction finds candidates for terms in your documents or in subject-specific texts which the user uploads or Sketch Engine can find such texts on the web.
  • Bilingual terminology extraction can be performed on the translation memory (TM) the user uploads. The result is a bilingual list of terms and their translations.

Terminology tasks can be aided with

  • usage checking – concordance finds examples of a phrase or word in context sourced from domain-specific corpora which Sketch Engine can automatically create for you.
  • word sketch will highlight the typical collocations, context and word combinations. Use general corpora for information about non-specialized language or subject-specific corpora for professional language.

Automated building of a subject-specific database of texts

Sketch Engine has a built-in tool that allows the user to create a database of subject-specific texts to check how the specialized language is used by real speakers of the language. There are three ways to create such a database (corpus):

  • upload any material the user has access to
  • have Sketch Engine look up and download relevant texts on the web
  • combination of the above options

It is advisable to work with small corpora (e.g. about 100,000 words) made up of terminology-rich texts because it may give more precise results for domain-specific work. Sketch Engine will automatically find and download relevant texts on the internet for you and your specialized corpus can be ready within minutes. Typically, it will take about 10 mins to create a 1,000,000 word corpus. All additional functionality will be available automatically: Word Sketch, concordance, term extraction, n-grams, word lists, etc. (feature availability is dependent on the language).

Here are listed a few examples of text domain corpora that can be found in Sketch Engine.

  • CAJA corpus – Corpus of Academic Journal Articles
  • Childes corpora – set of corpora containing a rich variety of computerized transcripts from language learners
  • COMPAS corpus – Corpus of the news articles related to immigration
  • DOAJ corpus – corpus of academic journal articles
  • Europarl parallel corpus – extracted from the proceedings of the European Parliament in 21 language
  • e-flux corpus – English art news digests
  • GerManC – historical Corpus of German Newspapers 1650–1800
  • SiBol corpus – corpus of English broadsheet newspapers
  • Timestamped corpora – continuously (new data each month) growing web corpora created from news articles

References

Adam Kilgarriff, Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2014). Finding Terms in Corpora for Many Languages with the Sketch Engine. In Proceedings of the Demonstrations at the 14th Conference the European Chapter of the Association for Computational Linguistics, Sweden, April 2014, pp. 53–56.

Adam Kilgarriff (2013). Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine. In Proceedings ASLIB 35th Translating and the Computer Conference, London, May 2013.

Bilingual Terminology Extraction in Sketch Engine. Vít Baisa, Barbora Ulipová, and Michal Cukr. In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Czech Republic, December 2015, pp. 61–67.

Sandra Young (2016). Using corpora in translation. Available here.

Adam Kilgarriff, Ondřej Herman, Jan Bušta, Pavel Rychlý and Miloš Jakubíček. DIACRAN: a framework for diachronic analysis (presentation). In Corpus Linguistics (CL2015), United Kingdom, July 2015.

Ondřej Herman and Vojtěch Kovář. Methods for Detection of Word Usage over Time. In Seventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2013. Brno: Tribun EU, 2013, pp. 79–85. ISBN 978-80-263-0520-0.

Ondřej Herman (2013). Automatic methods for detection of word usage in time. Bachelor thesis. Masaryk University, Faculty of Informatics.

In the paper below, Adam Kilgarriff offers an interesting, unusual and well-founded view of terminology.

Adam Kilgarriff (2007). I don’t believe in word sense. In Computers and the Humanities, 31(2), pp. 91–113.

Text corpora in Sketch Engine

Sketch Engine offers 500+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.