Here are resources which you can use to reference Sketch Engine in your papers:

General references

General Reference

logDice statistic

used (since 2008) to compute word sketches

Evaluation of Word Sketches

All statistics used in Sketch Engine

Corpus query language (CQL)

Bibliography of Sketch Engine

2016

  • European Union Language Resources in Sketch Engine
    • BAISA, Vít, Jan MICHELFEIT, Marek MEDVEĎ and Miloš JAKUBÍČEK
    • In the Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2799–2803, Slovenia, May 2016.
  • Fifty Shades Between Linguistics and Language Engineering
    • JAKUBÍČEK, Miloš, Vít BAISA, Jan BUŠTA, Vojtěch KOVÁŘ, Jan MICHELFEIT, Pavel RYCHLÝ and Vít SUCHOMEL

2015

  • Turkic Language Support in Sketch Engine
    • Vít Baisa and Vít Suchomel
    • In Proceedings of the international conference “Turkic Languages processing: TurkLang 2015”, Russia, September 2015, pp. 214–223
  • Automatic generation of the Estonian Collocations Dictionary database (presentation)
    • Jelena Kallas, Adam Kilgarriff, Kristina Koppel, Elgar Kudritski, Margit Langemets, Jan Michelfeit, Maria Tuulik, Ülle Viks
    • In Kosem, I., Jakubíček, M., Kallas, J., Krek, S. (eds.) Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, August 2015, Herstmonceux Castle, UK., pp. 1–20.
  • Interactive visualization methods for Sketch Engine
    • Lucia Kocincová, Miloš Jakubíček, Vojtěch Kovář and Vít Baisa
    • In Gintaré Grigonyté, Simon Clematide, Andrius Utka, Martin Volk (eds.). Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. Vilnius, Lithuania: Linköping University Electronic Press, Linköpings universitet, 2015, pp. 17–22
  • Learning Chinese with the Sketch Engine
    • Adam Kilgarriff, Nicole Keng, Simon Smith and Wei Bo
    • In Zou, B., Hoey, M. & Smith, S. (eds.). Corpus Linguistics  in Chinese Contexts. Basingstoke: Palgrave, 2015

2014

  • Effective Corpus Virtualization
    • Miloš Jakubíček, Adam Kilgarriff and Pavel Rychlý (2014)
    • In Challenges in the Management of Large Corpora (CMLC-2), May 2014
  • The Sketch Engine: ten years on
    • Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel (2014)
    • In Lexicography: Journal of ASIALEX, volume 1, issue 1, pp. 7–36
  • arTenTen: Arabic Corpus and Word Sketches
    • Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgarriff and Vít Suchomel (2014)
    • In Journal of King Saud University – Computer and Information Sciences, volume 26, issue 4, December 2014, pp. 381–395
  • Hindi Word Sketches
    • Anil Krishna Eragani, Varun Kuchibhotla, Dipti Sharma, Siva Reddy and Adam Kilgarriff (2014)
    • In Proceedings of the Conference on Natural Language Processing (ICON-11), Goa, India, December 2014, pp. 11818–125
  • Text Tokenisation Using unitok
    • Vít Suchomel, Jan Michelfeit and Jan Pomikálek (2014)
    • In Proceedings of the Eighth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2014, Czech Republic, December 2014, pp. 71–75
  • Bilingual Word Sketches: the translate Button
    • Vít Baisa, Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář and Pavel Rychlý
    • In Proceedings of the 16th EURALEX International Congress. 15–19 July 2014, Bolzano, Italy, pp. 505–513

2013

  • The TenTen Corpus Family
    • Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2013)
    • In Proceedings of the 7th International Corpus Linguistics Conference CL 2013, the United Kingdom, July 2013, pp. 125–127
  • Web Spam
    • Adam Kilgarriff and Vít Suchomel (2013)
    • In Proceedings of the 8th Web as Corpus Workshop (WAC-8), the United Kingdom, July 2013, pp. 46–52
  • arTenTen: a new, vast corpus for Arabic
    • Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, Noam Ordan, Ryan Roth and Vít Suchomel (2013)
    • In Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics, the United Kingdom, July 2013, pp. 20
  • Intrinsic Methods for Comparison of Corpora
    • Vít Baisa and Vít Suchomel (2013)
    • In Proceedings of the Seventh Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2013, Czech Republic, December 2013, pp. 51–58
  • 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング
    • (Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen)
    • Irena Srdanović, Vít Suchomel, Toshinobu Ogiso and Adam Kilgarriff (2013)
    • 『「第3回コーパス日本語学ワークショップ」予稿集』国立国語研究所 言語資源研究系・コーパス開発センター (In Proceeding of the 3rd Japanese corpus linguistics workshop, Department of Corpus Studies, Center for Corpus Development, NINJAL), pp. 229–238

2012

  • Word Sense Induction for Novel Sense Detection
    • Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin (2012)
    • In 13th Conference of the European Chapter of the Association for computational Linguistics (EACL 2012), France, April 2012, pp. 591–601
  • Getting to know your corpus
    • Adam Kilgarriff (2012)
    • In Proceedings of The 15th International Conference on Text, Speech and Dialogue (TSD), Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (eds.), Czech Republic, September 2012, pp. 3–15
  • Detecting Spam in Web Corpora
    • Vít Baisa and Vít Suchomel (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 69–76
  • Recent Czech Web Corpora
    • Vít Suchomel (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 77–83
  • Word Sketches for Turkish
    • Bharat Ram Ambati, Siva Reddy and Adam Kilgarriff (2012)
    • In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 2945–2950
  • Setting up for corpus lexicography
    • Adam Kilgarriff, Jan Pomikálek, Miloš Jakubíček and Pete Whitelock (2012)
    • In Proceedings of the 15th EURALEX International Congress, Norway, August 2012, pp. 31–55
  • Corpus Tools for Lexicographers
    • Adam Kilgarriff and Iztok Kosem (2012)
    • In Electronic Lexicography, Sylviane Granger and Magali Paquot (eds.), Oxford University Press, October 2012, pp. 31–55
  • Vietnamese Word Sketches
    • Adam Kilgarriff and Phuong Le-Hong (2012)
    • In Workshop on Vietnamese Language and Speech Processing (IEEE-RIVF 9), Vietnam, February 2012, pp. 1–4
  • Building A Thesaurus Using LDA-Frames
    • Jiří Materna (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 97–103

2011

  • Comparable Corpora BootCaT
    • Adam Kilgarriff, Avinesh PVS and Jan Pomikálek (2011)
    • In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 122–128
  • GDEX for Slovene
    • Iztok Kosem, Miloš Husák and Diana McCarthy (2011)
    • In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 151–159
  • Large Web Corpora for Indian Languages
    • Adam Kilgarriff and Girish Duvuru (2011)
    • In Proceedings of International Conference on Information Systems for Indian Languages (ICISIL), India, 2011 pp. 312–313
  • Polish Word Sketches
    • Adam Radziszewski, Adam Kilgarriff and Robert Lew (2011)
    • In Proceedings of the 5th Language & Technology Conference (LTC), Poland, November 2012, pp. 237–242
  • Japanese Word Sketches: Advances and Problems
    • Irena Srdanović, Naomi Ida, Chikako Shigemori Bučar, Adam Kilgarriff and Vojtěch Kovář (2011)
    • In Acta Linguistica Asiatica, University of Ljubljana, Slovenia 2011, pp. 63–82

2010

  • Helping Our Own
    • Robert Dale and Adam Kilgarriff (2010)
    • In International Natural Language Generation Conference, Dublin, Ireland
  • Studying Word Sketches for Russian
    • Maria Khokhlova and Victor Zakharov (2010)
    • In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12) Malta, May 2010, pp. 3491–3494
  • A Case Study in Word Sketches – Czech Verb vidět
    • Karel Pala and Pavel Rychlý (2010)
    • In A Way with Words: Recent Advances in Lexical Theory and Analysis. A Festschrift for Patrick Hanks. Ed. by Gilles-Maurice de Schryver, Menha Publishers, 2010, – “see”, pp. 187–198
  • Google The Verb
    • Adam Kilgarriff (2010)
    • In Language Resources and Evaluation Journal, 44 (3), pp. 281–290
  • Tickbox Lexicography
    • Adam Kilgarriff and Vojtěch Kovář and Pavel Rychlý (2010)
    • In eLexicography in the 21st century: New challenges, new applications, Presses universitaires de Louvain, Brussels, 2010, pp. 411–418
  • Semi-automatic Dictionary Drafting
    • Adam Kilgarriff and Pavel Rychlý (2010)
    • In A Way with Words: Recent Advances in Lexical Theory and Analysis, Uganda: Menha Publishers Ltd., 2010, 299–312
  • Corpora by Web Services
    • Adam Kilgarriff (2010)
    • In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12)Malta, May 2010
  • The RoWaC Corpus and Romanian Word Sketches
    • Monica Macoveiciuc and Adam Kilgarriff (2010)
    • In Multilinguality and Interoperability in Language Processing with Emphasis on Romanian Edited by Dan Tufis and Corina Forascu. Romanian Academy, pp. 151–168.
  • A Quantitative Evaluation of Word Sketches
    • Adam Kilgarriff, Vojtěch Kovář, Simon Krek, Irena Srdanovic and Carole Tiberius (2010)
    • In Proceedings of the 14th EURALEX International Congress. The Netherlands, July 2010, pp. 372–379

2009

  • Scaling to Billion-plus Word Corpora
    • Jan Pomikálek, Pavel Rychlý and Adam Kilgarriff
    • In Advances in Computational Linguistics, Instituto Politécnico Nacional, volume 41, Mexico, 2009, pp. 3–13
  • Simple maths for keywords
    • Adam Kilgarriff
    • In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.
  • Putting the corpus into the dictionary (firstly in 2005 as Linking Dictionary and Corpus)
    • Adam Kilgarriff (2009)
    • In V.B.Y. Ooi, A. Pakir, I.S. Talib and P.K. Tan (eds.). Perspectives in Lexicography: Asia and Beyond, IsraelK Dictionaries 2009, pp. 239–247
  • Extracting distant collocations of adverbs and modality forms using web corpus and query system
    • Irena Srdanovic, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
    • 「ウェブコーパスと検索システムを利用した推量副詞とモダリティ形式の遠隔共起抽出と日本語教育への応用」『自然言語処理』(Extracting distant collocations of adverbs and modality forms using web corpus and query system, Journal of Natural Language Processing), 16/4, pp. 29–46
  • Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations
    • Irena Srdanovic, Andrej Bekeš and Kikuko Nishina (2009)
    • 「コーパスに基づいた語彙シラバス作成に向けて―推量的副詞と文末モダリティの共起を中心にして―」『日本語教育』142号 (Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations, Journal of Japanese Language Education, 142, pp. 69–79)
  • Czech Word Sketch Relations with Full Syntax Parser
    • Aleš Horák, Pavel Rychlý and Adam Kilgarriff
    • In After Half a Century of Slavonic Natural Language Processing. Czech Republic, Brno: Masaryk University, 2009, pp. 101–112. ISBN 978-80-7399-815-8.
  • Classifying corpora based on adverbs distribution
    • Irena Srdanović, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
    • In International Quantitative Linguistics Conference (Qualico), Austria, September 2009

2008

  • A Lexicographer-Friendly Association Score
    • Pavel Rychlý (2008)
    • In Second Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2008. Brno, Masaryk University, 2008, pp. 6–9. ISBN 978-80-210-4741-9.
  • Cleaneval: a Competition for Cleaning Web Pages
    • Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff
    • In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco, May 2008, pp. 638–643
  • A web corpus and word sketches for Japanese
    • Irena Srdanović, Tomaž Erjavec and Adam Kilgarriff (2008)
    • A web corpus and word-sketches for Japanese『自然言語処理』(Journal of Natural Language Processing) 15/2, 137–159. (reprinted in Information and Media Technologies 3/3, 2008, pp. 529–551)

2007

  • Manatee/bonito – a modular corpus manager
    • Pavel Rychlý (2007)
    • In First Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2007. Brno: Masaryk University, 2007, pp. 65–70. ISBN 978-80-210-4471-5

2006

  • Slovene Word Sketches
    • Simon Krek and Adam Kilgarriff (2006)
    • In Proceedings 5th Slovenian/First International Languages Technology Conference, Slovenia, October 2006

2005 and earlier

  • Chinese word sketches
    • Adam Kilgarriff, Chu-Ren Huang, Pavel Rychlý, Simon Smith and David Tugwell (2005)
    • In Proc. Asialex, Singapore, June 2005
  • The sketch engine
    • Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, and David Tugwell (2004)
    • In Proceedings of the 11th EURALEX International Congress. France, July 2004, pp. 105–116 (reprinted in Lexicology: Critical concepts in Linguistics P. W. Hanks (ed.) Routledge, 2007)
  • Linguistic Search Engine
    • Adam Kilgarriff (2003)
    • In Proceedings of Workshop on Shallow Processing of Large Corpora, SProLaC03, the United Kingdom, pp. 53–58.

If you have any Sketch Engine related paper please do send the details and if possible a link to the document to us (email: support@sketchengine.co.uk)

Adam Kilgarriff’s bibliography

Theses related to Sketch Engine

Lucia Kocincová (2015). Interactive visualization methods for Sketch Engine. Master thesis. Masaryk University, Faculty of Informatics.

Abstract: Visualization is undoubtedly one of the most desired methods for displaying data, especially when dealing with so called big data. Visualization can uncover unnoticed and hidden relationships within the data and in addition, it enables the users to understand and interpret the data with less effort. This thesis focuses on interactive visualizations generated from the corpora data. First, it introduces the state-of-the-art tools for corpora visualizations and a corpus management system named Sketch Engine, for which numerous design concepts were created. Then four of them – corpora overview, thesaurus, word sketch and word sketch difference – were implemented as an online application with the main use of the Data-Driven Documents library. Last, these visualizations were evaluated by the user testing which revealed that the implemented concepts were not only graphically very appealing but also helpful. Therefore, the interactive visualizations will be incorporated in the Sketch Engine online interface in the upcoming future.

Matouš Ejem (2015). English learner corpora [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: Learner corpora conjoin second language acquisition research, foreign language teaching and corpus linguistics. In this work I present available English learner corpora.

Lucie Kaplanová (2015). Collection of linguistically motivated examples of CQL [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: This bachelor thesis deals with query language for corpora called CQL (Corpus Query Language). It explains use of individual operators, attributes, and structures that can be used in CQL search. The thesis also includes a set of linguistically oriented CQL queries for Czech and English.

Monika Močiariková (2015). Methods for Automatic Acquisition of Dictionary Definitions [in Slovak]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: The thesis is trying to explain the term definition and why it is difficult to say whether some sentences are definitions or not. It also describes the Sketch Engine system and the CQL language. The practice part is dedicated to design, implementation and evaluation of queries for automatic definition search.

Dominika Talianová (2014). Corpus Data Visualization. Bachelor thesis. Masaryk University, Faculty of Informatics.

Abstract: This thesis focuses on corpus data represented in graphical form. More closely, it consists of a recherché on visualization tools and a website created to hold visualizations based on two features of Sketch Engine, namely Word Sketch and Sketch-diff. These visualizations represent collocations and their salience in connection to different lemmas. The data essential for these visualizations are processed with the use of JavaScript and its D3 library in a JSON format and are provided by Natural Language Processing Centre at Masaryk University, Faculty of Informatics in Brno.

Radoslav Rábara (2014). Concurrent programming in searching text corpora [in Slovak]Bachelor thesis. Masaryk University, Faculty of Informatics.
Abstract: The aim of this thesis is to study approaches used in concurrent processing and to apply them to the evaluation of queries in the system Manatee. Part of the work is not only a detailed evaluation of queries processing speed with various number of cores available during the evaluation, but also a comparision of the length of code between the old and the new implementation.
Ondřej Herman (2013). Automatic methods for detection of word usage in time. Bachelor thesis. Masaryk University, Faculty of Informatics.

Abstract: From a natural language corpus, word usage data over time can be extracted. To detect and quantify change in this data, automatic procedures can be employed. In this work, the theory of ordinary and robust regression methods is discussed and applied to real world data with great success. A Python implementation is included. Smoothing of time series and detection of seasonality is examined, but ultimately this path does not seem to give satisfactory results for the data explored.

Miloš Husák (2008). Automatic Retrieval of Good Dictionary Examples. Bachelor thesis. Masaryk University, Faculty of Informatics.
Abstract: This thesis proposes and implements an algorithm for evaluation of sentences with respect to their understandability and informativeness. It can be embedded into a variety of applications, such as corpus querying tools or automated dictionaries. The proposed algorithm is highly customizable, since it employs a variety of criteria approximating the similarity of sentences to good dictionary examples. It was optimized using machine learning algorithms according to a set of manually labelled concordances. The algorithm is usable in practical applications, however it is still being developed.