A summary of Sketch Engine features aimed at terminologists such as term extraction and related functionality.
Features for terminology and terminography
- Term extraction finds candidates for terms in your documents or in subject-specific texts which the user uploads or Sketch Engine can find such texts on the web.
- Bilingual terminology extraction can be performed on the translation memory (TM) the user uploads. The result is a bilingual list of terms and their translations.
Terminology tasks can be aided with
- usage checking with the help of concordance searches which find examples of a phrase or word in context sourced from domain-specific copora which Sketch Engine can automatically create for you.
- word sketch will highlight the typical collocations and word combinations. Use general corpora for information about non-specialized language or subject-specific corpora for professional language.
Automated building of a subject-specific database of texts
Sketch Engine has a built-in tool which allows the user to create a database of subject-specific texts which can bue used for term extraction or for checking how the specialized language is used by real speakers of the language. There are three ways to create such a database (corpus):
- upload any material the user has access to
- have Sketch Engine look up and download relevant texts on the web
- combination of the above options
It is advisable to work with small corpora (e.g. about 100,000 words) made up of terminology-rich texts because it may give more precise results for domain-specific work. Sketch Engine will automatically find and download relevant texts on the internet for you and your specialized corpus can be ready within minutes. Typically, it will take about 10 mins to create a 1,000,000 word corpus. All additional functionality will be available automatically with your corpus: Word Sketch, concordance, term extraction, n-grams, word lists etc. (feature availability is dependent on the language).
List of domain corpora already available in Sketch Engine
- e-flux corpus – English art news digests
- SiBol corpus – corpus of English broadsheet newspapers
- GerManC – historical Corpus of German Newspapers 1650–1800
- TECU corpora – geodetics web corpora
- RapCor – small corpus of spoken French in rap songs
- Childes corpora – set of corpora containing rich variety of computerised transcripts from language learners
- Europarl parallel corpus – extracted from the proceedings of the European Parliament in 21 languages
Adam Kilgarriff, Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2014). Finding Terms in Corpora for Many Languages with the Sketch Engine. In Proceedings of the Demonstrations at the 14th Conference the European Chapter of the Association for Computational Linguistics, Sweden, April 2014, pp. 53–56.
Adam Kilgarriff (2013). Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine. In Proceedings ASLIB 35th Translating and the Computer Conference, London, May 2013.
Bilingual Terminology Extraction in Sketch Engine. Vít Baisa, Barbora Ulipová, and Michal Cukr. In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Czech Republic, December 2015, pp. 61–67.
Sandra Young (2016). Using corpora in translation. Available on the blog “The Deep End” http://inthedeepend.org/.
Adam Kilgarriff, Ondřej Herman, Jan Bušta, Pavel Rychlý and Miloš Jakubíček. DIACRAN: a framework for diachronic analysis (presentation). In Corpus Linguistics (CL2015), United Kingdom, July 2015.
Ondřej Herman and Vojtěch Kovář. Methods for Detection of Word Usage over Time. In Seventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2013. Brno: Tribun EU, 2013, pp. 79–85. ISBN 978-80-263-0520-0.
Ondřej Herman (2013). Automatic methods for detection of word usage in time. Bachelor thesis. Masaryk University, Faculty of Informatics.
In the paper below, Adam Kilgarriff offers an interesting, unusual and well-founded view of terminology.
Adam Kilgarriff (2007). I don’t believe in word sense. In Computers and the Humanities, 31(2), pp. 91–113.