Introduction, related links

Registration and first log in

Sketch Engine – building own corpora

Think of a specialist domain you know about: it could be photography, horse riding, Turkish carpets, … Now think of five specialist terms (probably in your mother tongue) in that domain. They can be one word or multi-word.

Now: on your Sketch Engine home page,

  • click on ‘WebBootCaT’
  • input a name for the corpus you are building
  • select a language
  • enter as ‘seed words’ the domain terms you thought of
  • put a space between them
  • surround multi-word terms with double-quotes
  • click ‘next’, check URLs, click OK and your corpus should now compile, wait a minute…
  • under ‘My Corpora’ see if corpus is there, click on it
  • How big is the corpus? In words. How many files/web pages?
  • extract keywords

Sketch Engine – corpus querying

Sketch Engine has a number of language-analysis functions, but the ones we will mainly be using in this workshop are:

  • Concordancer: full-text, linguisticaly motivated querying

  • Word Sketch: a one-page summary of a word’s collocational behaviour

For more information about the Sketch Engine, see Kilgarriff, A. et al. (2014) The Sketch Engine: Ten Years On.

Concordance

We will work with British National Corpus. When you open a corpus, you go by default to the Concordance search.

The first set of menu items in the left-hand column take you to other parts of the program, such as the Word Sketch or Thesaurus.

For the simplest type of search, enter your search term in the Simple Query box.

The three hyperlinks below the search box (Query types, Text types, and Context) each provide options for advanced searches.

Query types

If you click on ‘Query types’, you’ll have option of searching for lemmas, phrases, or word forms.

Searching for Lemmas

If the corpus is lemmatized, a search for a lemma (e.g. tell) will generate a concordance of all of the related word forms.

Searching for Phrases

Here you can enter any multiword expression, such as a compound noun or preposition (like business school or in preference to) or a longer string, such as you must be joking or weapons of mass destruction.

Searching for Word Forms

This allows you to search for a specific word form, such as burns, and you can optionally specify that you are looking for burns as a verb or burns as a plural noun.  You can make your search case-sensitive by checking the ‘match case’ box: this will enable you to search for Bush rather than bush, or pole but not Pole.

CQL

This is for inputting complex queries using Corpus Query Language. CQL is described in Corpus Querying and Grammar Writing.

Context

Here you can specify the right and/or left context of your search word, within a window of up to ten items on either side of the search word (in practice, you are unlikely to need such a large window). As context, you can specify either a particular word or one or more word classes (POS).

(1) search for the string shake (verb) followed by head (noun), to find instances such as she shook her head, if you agree shake your head, and shaking their heads in disbelief…

(2) search for the verb taste followed by any adjective.

Text Types

The dialog  allows you to limit your search to a specific part of the corpus (such as spoken texts only or texts from one particular time-span). It is fairly self-explanatory. Note that the Text Types that are available vary greatly from corpus to corpus.  The British National Corpus has rich text-type information whereas web corpora tend to have very little.

Manipulating concordances

Once you have generated a concordance, there are several options for increasing its usefulness. See the summary at the top of the concordance.

Options in left-hand column

View Options: lets you toggle between standard KWIC concordance view (which appears by default) and full sentence view, and also takes you to a new screen that allows you to change the concordance view in a number of ways: for example, you might want to see the part-of-speech tags as well as the words.

Sort: you can either just click on the indented options to do a simple sort, or click ‘Sort’ itself to enter the Sort screen, where you can specify a more complex sort procedure.  Sorting is often a quick way of revealing patterns: a right sort of a haunt shows 9 instances of haunt me for [TIME-PHRASE] in the BNC (e.g. This question haunted me for half of last year). Simple sorts include sorting the concordance to the Right or Left, or sorting it according to the Node word (this would put all the instances of haunt first, then haunted, then haunting, then haunts).

Sample: useful if you are looking at a very frequent search item. It allows you to create a random sample of the corpus lines, to any figure you specify. If you search for play=verb and decide that you don’t want to analyse 37,632 lines, use ‘Sample’ to reduce this to a manageable number.

Filter: allows you to specify constraints on the context of your concordance, in order to retrieve a subset of your concordance. More info.

Frequency: clicking Frequency itself allows you to view two types of frequency information regarding your search term:

  • Multilevel frequency distribution shows the frequency of each form of a given lemma.

  • Text Type frequency distribution shows how your search term is distributed through the texts in the corpus. You may find, for example, that a word like police appears significantly more often in newspaper texts than in other text types. This is a potentially useful tool which could show you – for example – that a particular medical term is not restricted to specialized medical discourse.

You can alternatively use the simpler frequency options below Frequency to sort by:

  • Node tags: the PoS tags for all the KWIC word forms (node word types)

  • Node forms: the word forms for all the KWIC word forms

  • Doc IDs: frequency distribution over the document ids

  • Text Types: frequency distribution over all the text types specified for the corpus

Collocations: allows you to generate lists of words that co-occur frequently with your node word (its ‘collocates’).  In general, however, the Word Sketch provides a more sophisticated account of collocation.

ConcDesc: provides a technical description of your query. This is useful for programmers and technical people.

Visualize: shows you the distributional graph (clickable) of the concordance within the corpus.

Finding out about a particular concordance line

If you click on one of the node words, more of its context appears in a pop-up at the bottom of the screen. The pop-up here is showing fuller context for the first concordance line on the page.

To get information about the source-text a particular concordance line comes from, click the document-id code (e.g. J0P) at the left-hand end of the relevant line. This brings up ‘header’ information in the bottom pane.

Word Sketch

A Word Sketch is a corpus-based summary of a word’s grammatical and collocational behaviour.

Choose a lemma and specify its part of speech using the drop-down list. Word Sketches are available for nouns, verbs, and adjectives, but not for other word classes. They also depend on the availability of substantial amounts of data, so if you try to create a Word Sketch for a fairly rare item (e.g. coagulate) you may see a message saying there is no Word Sketch available. In general, you need several hundred instances of a word to make a useful word sketch.

Each column shows the words that typically combine with challenge in a particular grammatical relations (or ‘gramrels’). Most of these gramrels are self-explanatory. For example, ‘object_of’ lists – in order of statistical significance rather than raw frequency – the verbs that most typically occupy the verb slot in cases where challenge is the object of a verb.  Most of the data is lexicographically relevant, though one might query the adjectival modifier larval: it turns out that ‘larval challenge’ is a technical term used in parasitology, discussed in a BNC document.

You can at any time switch between Concordance mode and Word Sketch mode, and this is a useful way of getting more information about a particular word combination. Thus, if you want to look at examples of the string ‘pose + challenge’, simply click on the number next to ‘pose’ in the object_of list (92) and you will be taken directly to a concordance showing all instances of this combination.

Other features

Thesaurus, Trends, Word Sketch Difference, Corpus Information, Parallel corpora, Word list.