Here are resources which you can use to reference Sketch Engine in your papers:

General references

General Reference

  • Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovvář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel. The Sketch Engine: ten years on. Lexicography, 1: 7-36, 2014.
    [BibTeX] [Download PDF]
    @article{kilgarriff2014sketch,
      title={The Sketch Engine: ten years on},
      author={Kilgarriff, Adam and Baisa, Vít and Bušta, Jan and Jakubíček, Miloš and Kovvář, Vojtěch and Michelfeit, Jan and Rychlý, Pavel and Suchomel, Vít},
      journal={Lexicography},
      year={2014},
      pages={7--36},
      publisher={Springer}
    }
  • Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, David Tugwell. Itri-04-08 the sketch engine. Information Technology, 2004.
    [BibTeX] [Download PDF]
    @article{kilgarriff2004itri,
      title={Itri-04-08 the sketch engine},
      author={Kilgarriff, Adam and Rychlý, Pavel and Smrž, Pavel and Tugwell, David},
      journal={Information Technology},
      year={2004},
      pages={},
      publisher={}
    }
  • also please mention the following web address http://www.sketchengine.co.uk

logDice statistic

used (since 2008) to compute word sketches

  • Pavel Rychlý. A Lexicographer-Friendly Association Score. Proc. 2nd Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN, 2: 6-9, 2008.
    [BibTeX] [Download PDF]
    @article{rychlý2008lexicographer,
      title={A Lexicographer-Friendly Association Score},
      author={Rychlý, Pavel},
      journal={Proc. 2nd Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN},
      year={2008},
      pages={6--9},
      publisher={Masaryk University}
    }

Evaluation of Word Sketches

  • Adam Kilgarriff, Vojtěch Kovář, Simon Krek, Irena Srdanovič, Carole Tiberius. A quantitative evaluation of word sketches. Proceedings of the 14th EURALEX International Congress: 372-79, 2010.
    [BibTeX] [Download PDF]
    @article{kilgarriff2010quantitative,
      title={A quantitative evaluation of word sketches},
      author={Kilgarriff, Adam and Kovář, Vojtěch and Krek, Simon and Srdanovič, Irena and Tiberius, Carole},
      journal={Proceedings of the 14th EURALEX International Congress},
      year={2010},
      pages={372--79},
      publisher={Fryske Akademy-Afûk}
    }

All statistics used in Sketch Engine

Corpus query language (CQL)

  • Miloš Jakubíček, Adam Kilgarriff, Diana McCarthy, Pavel Rychlý. Fast Syntactic Searching in Very Large Corpora for Many Languages. PACLIC: 741-47, 2010.
    [BibTeX] [Download PDF]
    @article{jakubíček2010fast,
      title={Fast Syntactic Searching in Very Large Corpora for Many Languages},
      author={Jakubíček, Miloš and Kilgarriff, Adam and McCarthy, Diana and Rychlý, Pavel},
      journal={PACLIC},
      year={2010},
      pages={741--47},
      publisher={Tohuku University}
    }

Bibliography of Sketch Engine

2017

The advent of post-editing lexicography

  • Miloš Jakubíček
  • In Kernerman Dictionary News, 25, July 2017, pp. 14–15

Walking the tightrope between linguistics and language engineering

  • Miloš Jakubíček, Vít Baisa, Jan Bušta, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel

2016

  • An Exploratory Analysis of ScienceBlog
    • Caterina Allais
    • In L’Analisi Linguistica e Letteraria, Facoltà di Scienze Linguistiche e Letterature straniere Università Cattolica del Sacro Cuore, Milano, December 2016, pp. 161–170
  • Annotated Amharic Corpora
    • Pavel Rychlý, Vít Suchomel
    • In Petr Sojka, Aleš Horák, Ivan Kopeček, Karel Pala. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings, pp. 295-302, DOI 10.1007/978-3-319-45510-5_34
  • European Union Language Resources in Sketch Engine
    • BAISA, Vít, Jan MICHELFEIT, Marek MEDVEĎ and Miloš JAKUBÍČEK
    • In the Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2799–2803, Slovenia, May 2016.
  • Finding Definitions in Large Corpora with Sketch Engine
    • Vojtěch Kovář, Monika Močiariková, Pavel Rychlý
    • In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. pp. 391–394
  • RuSkELL: Online Language Learning Tool for Russian Language
    • Valentina Apresjan, Vít Baisa, Olga Buivulova, Olga Kultepina
    • In Tinatin Margalitadze, George Meladze. Proceedings of the XVII EURALEX International congress. Tbilisi: Ivane Javakhishvili Tbilisi State University, 2016. pp. 292–299

2015

  • Turkic Language Support in Sketch Engine
    • Vít Baisa and Vít Suchomel
    • In Proceedings of the international conference “Turkic Languages processing: TurkLang 2015”, Russia, September 2015, pp. 214–223
  • Automatic generation of the Estonian Collocations Dictionary database (presentation)
    • Jelena Kallas, Adam Kilgarriff, Kristina Koppel, Elgar Kudritski, Margit Langemets, Jan Michelfeit, Maria Tuulik, Ülle Viks
    • In Kosem, I., Jakubíček, M., Kallas, J., Krek, S. (eds.) Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, August 2015, Herstmonceux Castle, UK., pp. 1–20.
  • Interactive visualization methods for Sketch Engine
    • Lucia Kocincová, Miloš Jakubíček, Vojtěch Kovář and Vít Baisa
    • In Gintaré Grigonyté, Simon Clematide, Andrius Utka, Martin Volk (eds.). Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. Vilnius, Lithuania: Linköping University Electronic Press, Linköpings universitet, 2015, pp. 17–22
  • Learning Chinese with the Sketch Engine
    • Adam Kilgarriff, Nicole Keng, Simon Smith and Wei Bo
    • In Zou, B., Hoey, M. & Smith, S. (eds.). Corpus Linguistics  in Chinese Contexts. Basingstoke: Palgrave, 2015

2014

  • Effective Corpus Virtualization
    • Miloš Jakubíček, Adam Kilgarriff and Pavel Rychlý (2014)
    • In Challenges in the Management of Large Corpora (CMLC-2), May 2014
  • The Sketch Engine: ten years on
    • Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel (2014)
    • In Lexicography: Journal of ASIALEX, volume 1, issue 1, pp. 7–36
  • arTenTen: Arabic Corpus and Word Sketches
    • Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgarriff and Vít Suchomel (2014)
    • In Journal of King Saud University – Computer and Information Sciences, volume 26, issue 4, December 2014, pp. 381–395
  • Hindi Word Sketches
    • Anil Krishna Eragani, Varun Kuchibhotla, Dipti Sharma, Siva Reddy and Adam Kilgarriff (2014)
    • In Proceedings of the Conference on Natural Language Processing (ICON-11), Goa, India, December 2014, pp. 11818–125
  • Text Tokenisation Using unitok
    • Vít Suchomel, Jan Michelfeit and Jan Pomikálek (2014)
    • In Proceedings of the Eighth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2014, Czech Republic, December 2014, pp. 71–75
  • Bilingual Word Sketches: the translate Button
    • Vít Baisa, Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář and Pavel Rychlý
    • In Proceedings of the 16th EURALEX International Congress. 15–19 July 2014, Bolzano, Italy, pp. 505–513

2013

  • The TenTen Corpus Family
    • Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2013)
    • In Proceedings of the 7th International Corpus Linguistics Conference CL 2013, the United Kingdom, July 2013, pp. 125–127
  • Web Spam
    • Adam Kilgarriff and Vít Suchomel (2013)
    • In Proceedings of the 8th Web as Corpus Workshop (WAC-8), the United Kingdom, July 2013, pp. 46–52
  • arTenTen: a new, vast corpus for Arabic
    • Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, Noam Ordan, Ryan Roth and Vít Suchomel (2013)
    • In Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics, the United Kingdom, July 2013, pp. 20
  • Intrinsic Methods for Comparison of Corpora
    • Vít Baisa and Vít Suchomel (2013)
    • In Proceedings of the Seventh Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2013, Czech Republic, December 2013, pp. 51–58
  • 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング
    • (Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen)
    • Irena Srdanović, Vít Suchomel, Toshinobu Ogiso and Adam Kilgarriff (2013)
    • 『「第3回コーパス日本語学ワークショップ」予稿集』国立国語研究所 言語資源研究系・コーパス開発センター (In Proceeding of the 3rd Japanese corpus linguistics workshop, Department of Corpus Studies, Center for Corpus Development, NINJAL), pp. 229–238

2012

  • Word Sense Induction for Novel Sense Detection
    • Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin (2012)
    • In 13th Conference of the European Chapter of the Association for computational Linguistics (EACL 2012), France, April 2012, pp. 591–601
  • Getting to know your corpus
    • Adam Kilgarriff (2012)
    • In Proceedings of The 15th International Conference on Text, Speech and Dialogue (TSD), Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (eds.), Czech Republic, September 2012, pp. 3–15
  • Detecting Spam in Web Corpora
    • Vít Baisa and Vít Suchomel (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 69–76
  • Recent Czech Web Corpora
    • Vít Suchomel (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 77–83
  • Finding Multiwords of More Than Two Words
    • Adam Kilgarriff, Pavel Rychlý, Vojtěch Kovář and Vít Baisa (2012)
    • In Proceedings of the 15th EURALEX International Congress, Norway, August 2012, pp. 693–700
  • Word Sketches for Turkish
    • Bharat Ram Ambati, Siva Reddy and Adam Kilgarriff (2012)
    • In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 2945–2950
  • Setting up for corpus lexicography
    • Adam Kilgarriff, Jan Pomikálek, Miloš Jakubíček and Pete Whitelock (2012)
    • In Proceedings of the 15th EURALEX International Congress, Norway, August 2012, pp. 31–55
  • Corpus Tools for Lexicographers
    • Adam Kilgarriff and Iztok Kosem (2012)
    • In Electronic Lexicography, Sylviane Granger and Magali Paquot (eds.), Oxford University Press, October 2012, pp. 31–55
  • Vietnamese Word Sketches
    • Adam Kilgarriff and Phuong Le-Hong (2012)
    • In Workshop on Vietnamese Language and Speech Processing (IEEE-RIVF 9), Vietnam, February 2012, pp. 1–4
  • Building A Thesaurus Using LDA-Frames
    • Jiří Materna (2012)
    • In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 97–103

2011

  • Comparable Corpora BootCaT
    • Adam Kilgarriff, Avinesh PVS and Jan Pomikálek (2011)
    • In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 122–128
  • GDEX for Slovene
    • Iztok Kosem, Miloš Husák and Diana McCarthy (2011)
    • In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 151–159
  • Large Web Corpora for Indian Languages
    • Adam Kilgarriff and Girish Duvuru (2011)
    • In Proceedings of International Conference on Information Systems for Indian Languages (ICISIL), India, 2011 pp. 312–313
  • Polish Word Sketches
    • Adam Radziszewski, Adam Kilgarriff and Robert Lew (2011)
    • In Proceedings of the 5th Language & Technology Conference (LTC), Poland, November 2012, pp. 237–242

2010

  • Helping Our Own
    • Robert Dale and Adam Kilgarriff (2010)
    • In International Natural Language Generation Conference, Dublin, Ireland
  • Studying Word Sketches for Russian
    • Maria Khokhlova and Victor Zakharov (2010)
    • In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12) Malta, May 2010, pp. 3491–3494
  • A Case Study in Word Sketches – Czech Verb vidět
    • Karel Pala and Pavel Rychlý (2010)
    • In A Way with Words: Recent Advances in Lexical Theory and Analysis. A Festschrift for Patrick Hanks. Ed. by Gilles-Maurice de Schryver, Menha Publishers, 2010, – “see”, pp. 187–198
  • Google The Verb
    • Adam Kilgarriff (2010)
    • In Language Resources and Evaluation Journal, 44 (3), pp. 281–290
  • Tickbox Lexicography
    • Adam Kilgarriff and Vojtěch Kovář and Pavel Rychlý (2010)
    • In eLexicography in the 21st century: New challenges, new applications, Presses universitaires de Louvain, Brussels, 2010, pp. 411–418
  • Semi-automatic_dictionary_2010
    • Adam Kilgarriff and Pavel Rychlý (2010)
    • In A Way with Words: Recent Advances in Lexical Theory and Analysis, Uganda: Menha Publishers Ltd., 2010, 299–312
  • Corpora by Web Services
    • Adam Kilgarriff (2010)
    • In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12)Malta, May 2010
  • A corpus factory for many languages
    • Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010)
    • LREC workshop on Web Services and Processing Pipelines, Malta, May 2010
  • The RoWaC Corpus and Romanian Word Sketches
    • Monica Macoveiciuc and Adam Kilgarriff (2010)
    • In Multilinguality and Interoperability in Language Processing with Emphasis on Romanian Edited by Dan Tufis and Corina Forascu. Romanian Academy, pp. 151–168.

2009

  • Scaling to Billion-plus Word Corpora
    • Jan Pomikálek, Pavel Rychlý and Adam Kilgarriff
    • In Advances in Computational Linguistics, Instituto Politécnico Nacional, volume 41, Mexico, 2009, pp. 3–13
  • Simple maths for keywords
    • Adam Kilgarriff
    • In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.
  • Putting the corpus into the dictionary (firstly in 2005 as Linking Dictionary and Corpus)
    • Adam Kilgarriff (2009)
    • In V.B.Y. Ooi, A. Pakir, I.S. Talib and P.K. Tan (eds.). Perspectives in Lexicography: Asia and Beyond, IsraelK Dictionaries 2009, pp. 239–247
  • Extracting distant collocations of adverbs and modality forms using web corpus and query system
    • Irena Srdanovic, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
    • 「ウェブコーパスと検索システムを利用した推量副詞とモダリティ形式の遠隔共起抽出と日本語教育への応用」『自然言語処理』(Extracting distant collocations of adverbs and modality forms using web corpus and query system, Journal of Natural Language Processing), 16/4, pp. 29–46
  • Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations
    • Irena Srdanovic, Andrej Bekeš and Kikuko Nishina (2009)
    • 「コーパスに基づいた語彙シラバス作成に向けて―推量的副詞と文末モダリティの共起を中心にして―」『日本語教育』142号 (Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations, Journal of Japanese Language Education, 142, pp. 69–79)
  • Czech Word Sketch Relations with Full Syntax Parser
    • Aleš Horák, Pavel Rychlý and Adam Kilgarriff
    • In After Half a Century of Slavonic Natural Language Processing. Czech Republic, Brno: Masaryk University, 2009, pp. 101–112. ISBN 978-80-7399-815-8.
  • Classifying corpora based on adverbs distribution
    • Irena Srdanović, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
    • In International Quantitative Linguistics Conference (Qualico), Austria, September 2009

2008

  • A Lexicographer-Friendly Association Score
    • Pavel Rychlý (2008)
    • In Second Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2008. Brno, Masaryk University, 2008, pp. 6–9. ISBN 978-80-210-4741-9.
  • Cleaneval: a Competition for Cleaning Web Pages
    • Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff
    • In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco, May 2008, pp. 638–643
  • A web corpus and word sketches for Japanese
    • Irena Srdanović, Tomaž Erjavec and Adam Kilgarriff (2008)
    • A web corpus and word-sketches for Japanese『自然言語処理』(Journal of Natural Language Processing) 15/2, 137–159. (reprinted in Information and Media Technologies 3/3, 2008, pp. 529–551)

2007

  • Manatee/bonito – a modular corpus manager
    • Pavel Rychlý (2007)
    • In First Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2007. Brno: Masaryk University, 2007, pp. 65–70. ISBN 978-80-210-4471-5

2006

  • Slovene Word Sketches
    • Simon Krek and Adam Kilgarriff (2006)
    • In Proceedings 5th Slovenian/First International Languages Technology Conference, Slovenia, October 2006

2005 and earlier

  • Chinese Word Sketches
    • Adam Kilgarriff, Chu-Ren Huang, Pavel Rychlý, Simon Smith and David Tugwell (2005)
    • In Proc. Asialex, Singapore, June 2005
  • The sketch engine
    • Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, and David Tugwell (2004)
    • In Proceedings of the 11th EURALEX International Congress. France, July 2004, pp. 105–116 (reprinted in Lexicology: Critical concepts in Linguistics P. W. Hanks (ed.) Routledge, 2007)
  • Linguistic Search Engine
    • Adam Kilgarriff (2003)
    • In Proceedings of Workshop on Shallow Processing of Large Corpora, SProLaC03, the United Kingdom, pp. 53–58.

If you have any Sketch Engine related paper please do send the details and if possible a link to the document to us (email: support@sketchengine.co.uk)

Adam Kilgarriff’s bibliography

Theses related to Sketch Engine

Kletečka Jiří. Wikipedia Learner's Corpus. Master's thesis, Masaryk University, Faculty of Informatics, 2017. (in Czech)

[BibTeX] [Download PDF]
@mastersthesis{jiří2017wikipedia,
  title={Wikipedia Learner's Corpus},
  author={Jiří, Kletečka},
  school={Masaryk University, Faculty of Informatics},
  year={2017}
}

Abstract: This bachelor’s thesis deals with an automated creation of error-annotated corpus from Wikipedia history of articles. Such corpus contains the newest versions of articles with marked errors obtained from their editing history. For that reason, a new tool was designed and implemented. After implementation, it was used in the process of corpus creation using Czech Wikipedia database dump and this corpus was uploaded to the faculty server for public use through interface of Sketch Engine.

Cukr Michal. Czech corpus of example sentences. Master's thesis, Masaryk University Faculty of Arts, 2017. (in Czech)

[BibTeX] [Download PDF]
@mastersthesis{michal2017czech,
  title={Czech corpus of example sentences},
  author={Michal, Cukr},
  school={Masaryk University Faculty of Arts},
  year={2017}
}

Abstract: The purpose of this work was creating a Czech text corpus of sentence examples for a special language-learning interface SkELL. As source texts, we downloaded websites chosen for selective harvests by Czech Webarchiv and Czech Wikipedia including discussion. The third source is a part of JSI Newsfeed Corpus. Crawled texts were prepared by tools for corpus processing and the final text collection was deduplicated. Afterwards, we performed multiple cleaning. In the thesis, there are some examples from the created corpus. This corpus of Czech sentence examples is placed in the university installation of Sketch Engine (https://ske.fi.muni.cz/). The public access to the corpus is via SkELL interface available at http://cskell.sketchengine.co.uk/run.cgi/skell.

Rábara Radoslav. Parallelization of the corpus manager's time-consuming operations. Master's thesis, Masaryk University, Faculty of Informatics, 2016. (in Czech)
[BibTeX] [Download PDF]
@mastersthesis{radoslav2016parallelization,
  title={Parallelization of the corpus manager's time-consuming operations},
  author={Radoslav, Rábara},
  school={Masaryk University, Faculty of Informatics},
  year={2016}
}

Abstract: The Manatee corpus manager can process large corpora containing billions of words. Some operations with search results from such large corpora can be time-consuming. This thesis provides and describes a system that enables computation of the selected operations in parallel. The system is evaluated on a single computer, and on a cluster of computers. The evaluation contains evaluation of the scalability, and comparions with the Manatee system and a MapReduce system that provides a platform for distributed computing. 

Lucia Kocincová (2015). Interactive visualization methods for Sketch Engine. Master thesis. Masaryk University, Faculty of Informatics.

Abstract: Visualization is undoubtedly one of the most desired methods for displaying data, especially when dealing with so called big data. Visualization can uncover unnoticed and hidden relationships within the data and in addition, it enables the users to understand and interpret the data with less effort. This thesis focuses on interactive visualizations generated from the corpora data. First, it introduces the state-of-the-art tools for corpora visualizations and a corpus management system named Sketch Engine, for which numerous design concepts were created. Then four of them – corpora overview, thesaurus, word sketch and word sketch difference – were implemented as an online application with the main use of the Data-Driven Documents library. Last, these visualizations were evaluated by the user testing which revealed that the implemented concepts were not only graphically very appealing but also helpful. Therefore, the interactive visualizations will be incorporated in the Sketch Engine online interface in the upcoming future.

Matouš Ejem (2015). English learner corpora [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: Learner corpora conjoin second language acquisition research, foreign language teaching and corpus linguistics. In this work I present available English learner corpora.

Lucie Kaplanová (2015). Collection of linguistically motivated examples of CQL [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: This bachelor thesis deals with query language for corpora called CQL (Corpus Query Language). It explains use of individual operators, attributes, and structures that can be used in CQL search. The thesis also includes a set of linguistically oriented CQL queries for Czech and English.

Monika Močiariková (2015). Methods for Automatic Acquisition of Dictionary Definitions [in Slovak]. Bachelor thesis. Masaryk University, Faculty of Arts.

Abstract: The thesis is trying to explain the term definition and why it is difficult to say whether some sentences are definitions or not. It also describes the Sketch Engine system and the CQL language. The practice part is dedicated to design, implementation and evaluation of queries for automatic definition search.

Dominika Talianová (2014). Corpus Data Visualization. Bachelor thesis. Masaryk University, Faculty of Informatics.

Abstract: This thesis focuses on corpus data represented in graphical form. More closely, it consists of a recherché on visualization tools and a website created to hold visualizations based on two features of Sketch Engine, namely Word Sketch and Sketch-diff. These visualizations represent collocations and their salience in connection to different lemmas. The data essential for these visualizations are processed with the use of JavaScript and its D3 library in a JSON format and are provided by Natural Language Processing Centre at Masaryk University, Faculty of Informatics in Brno.

Radoslav Rábara (2014). Concurrent programming in searching text corpora [in Slovak]Bachelor thesis. Masaryk University, Faculty of Informatics.
Abstract: The aim of this thesis is to study approaches used in concurrent processing and to apply them to the evaluation of queries in the system Manatee. Part of the work is not only a detailed evaluation of queries processing speed with various number of cores available during the evaluation, but also a comparision of the length of code between the old and the new implementation.
Ondřej Herman (2013). Automatic methods for detection of word usage in time. Bachelor thesis. Masaryk University, Faculty of Informatics.

Abstract: From a natural language corpus, word usage data over time can be extracted. To detect and quantify change in this data, automatic procedures can be employed. In this work, the theory of ordinary and robust regression methods is discussed and applied to real world data with great success. A Python implementation is included. Smoothing of time series and detection of seasonality is examined, but ultimately this path does not seem to give satisfactory results for the data explored.

Miloš Husák (2008). Automatic Retrieval of Good Dictionary Examples. Bachelor thesis. Masaryk University, Faculty of Informatics.
Abstract: This thesis proposes and implements an algorithm for evaluation of sentences with respect to their understandability and informativeness. It can be embedded into a variety of applications, such as corpus querying tools or automated dictionaries. The proposed algorithm is highly customizable, since it employs a variety of criteria approximating the similarity of sentences to good dictionary examples. It was optimized using machine learning algorithms according to a set of manually labelled concordances. The algorithm is usable in practical applications, however it is still being developed.