Methods and specific attributes in Sketch Engine

Sketch Engine operates with many methods containing various attributes. In this section, all method are listed and attributes specific to each method are described. The “universal attributes” (that can be used with all method) are described at the end of this page. Note that some characters (e.g. space) that can be contained in the attribute values must be escaped.

For more information, see e.g. http://en.wikipedia.org/wiki/Percent-encoding. The output of these methods is on the JSON API documentation page.

Universal attributes

There are few attributes that can be used with any method.

Parameter Type Default Description
corpname string REQUIRED corpus name (in the short form, e.g. ‘bnc’) which will be processed. You can query your own corpus (e.g. username john, corpus mycorpus), just use value user/john/mycorpus
usesubcorp string empty name of a subcorpus that will be processed. Default is empty which means working with the entire corpus
format string empty the format of the output; empty value is interpreted as HTML; the only other possible value is json that means output in the JSON format
json JSON all input attributes encoded as a string in JSON

corp_info

Provides detailed information about the corpus, lexicon sizes etc.

Parameter Type Default Description
gramrels boolean 0 get list of grammar relations from the respective word sketch grammar
corpcheck boolean 0 get output from last corpcheck (if available in compilation log)
registry boolean 0 get registry file content and settings from manatee (might differ)
subcorpora boolean 0 get list of subcorpora and their respective sizes (in tokens, words)
struct_attr_stats boolean 0 get structure attributes, and their lexicon sizes

Example

import requests

requests.get('run.cgi/corp_info?corpname=bnc2;gramrels=1;subcorpora=1')

Example output

{
 "info": "Balanced English corpus ...",
 "encoding": "UTF-8",
 "compiled": "12/07/2016 14:31:51",
 "unicameral": false,
 "alsizes": [],
 "tagsetdoc": "https://...",
 "gramrels": [],
 "structs": [],
 "wposlist": [
 ["adjective", "AJ."],
 ...
 ],
 "lang": "English",
 "name": "British National Corpus (BNC) ...",
 "sizes": {
 "tokencount": "112181015",
 "sentcount": "6052184",
 "wordcount": "96052598",
 "normsum": "96052598",
 "parcount": "1514906",
 "doccount": "4054"
 },
 "subcorpora": [],
 "infohref": "http://...",
 "lposlist": [
 ["adjective", "-j"],
 ...
 ],
 "attributes": [
 {
 "fromattr": "",
 "id_range": 0,
 "dynamic": "",
 "name": "word",
 "label": ""
 },
 ...
 ]
}

wordlist

This method provides a functionality of “word list” and “keywords” functions that are normally available under the link “Word List” in the web interface.

Attributes:

  • keywords – if empty, “word list” is returned, else “keywords” function is used
  • wlattr – corpus attribute that you want to work with. This attribute is required.
  • wlnums – defines a type of frequency figures – possible values are: frq, docf, arf (for word frequency, document frequency and ARF accordingly)
  • wlminfreq – minimum frequency in corpus (default 5)
  • wlmaxitems – maximum number of displayed lines (default 100)
  • wlpat – regular expression that specifies the word list pattern (default ‘.*’ – all words). Relevant only in combination with “word list” function.
  • wlicase – “ignore case” mark. Values ‘1’, ‘0’ (default). Relevant only in combination with “word list” function.
  • wlsort – if ‘f’, resulting word list is sorted according to frequency. Else alphabetically according to attribute (default). Relevant only in combination with “word list” function.
  • ref_corpname – corpus name (in the short form, e.g. ‘bnc’) of the reference corpus. Relevant only in combination with “keywords” function. In this case, it is required.
  • ref_usesubcorp – reference subcorpus name. Relevant only in combination with “keywords” function.
  • wlfile – allows to send a file with whitelist via POST request
  • wlblacklist – allows to send a file with blacklist via POST request

Note: wordlist_form method (that returns the word list input form) is related to this method.

attr_vals

Parameter Value (default) Default Description
avattr string (REQUIRED) REQUIRED structure attribute
avpat string empty substring to be searched in RE ‘.*avpat.*’; empty means search for any pattern
avmaxitems integer 0 maximum items to be returned
avfrom integer 0 start from nth item

Example query

run.cgi/attr_vals?corpname=bnc2;avpat=br;avattr=u.who

Example output

{
 "query": "br",
 "suggestions": ["PS4BR", "PS3BR", "PS2BR", "PS1BR"],
 "no_more_values": true
}

corpus_lpos

Returns list of lempos suffixes (LPOSLIST) for a corpus.

Example query

run.cgi/corpus_lpos?corpname=bnc2

Example output

{
 "lposlist": [
 ["noun", "-n"],
 ["verb", "-v"],
 ["adjective", "-j"]
 ]
}

wsketch

Word sketch method for retrieving a survey of a word’s collocational behaviour.

Parameter Type Default Description
lemma string REQUIRED lemma, basic wordform
lpos string auto part of speech in notation ‘-n’, ‘-v’, … but the particular notation depends on a corpus. If the corpus contains “lempos” attribute and lpos attribute is omitted, it is automatically replaced by the most frequent lpos for the specified lemma. Otherwise, it has no effect.
sort_gramrels boolean (integer) 1 sort grammatical relation
minfreq integer, auto auto minimum frequency of a collocate. ‘auto’ is a function of corpus size
minscore float 0.0 minimum salience of a collocate
maxitems integer 25 maximum number of items in a grammatical relation
clustercolls integer  0  cluster collocations
minsim float 0.15 minimum similarity between clustered items, relevant only when clustercolls is set to 1

Note: wsketch_form method (that returns the word sketch input form) is related to this method.

Example script

#!/usr/bin/python

import time
import requests

base_url = 'https://api.sketchengine.co.uk/bonito/run.cgi'

data = {
    'corpname': 'bnc2',
    'format': 'json',
    'lpos': '-v',
    # get your API key here: https://the.sketchengine.co.uk/auth/api_access/
    'username': '',
    'api_key': ''
}

for item in ['make', 'ensure']:
    data['lemma'] = item
    d = requests.get(base_url + '/wsketch', params=data).json()
    print 'Word sketch data for', item
    for g in d['Gramrels'][:3]:
        print '    ' + g['name']
        for i in g['Words'][:3]:
            print '        ' + i['word']
    # beware of FUP, see https://www.sketchengine.co.uk/service-level-agreement/
    time.sleep(5)

Example output

Word sketch data for make
    subject
        decision
        company
        God
    object
        decision
        sense
        use
    usage patterns
        np_pp
        passive
        Sfin
Word sketch data for ensure
    subject
        arbitrage
        draftsman
        tenant
    object
        compliance
        survival
        continuity
    usage patterns
        Sfin
        np_pp
        passive

thes

Thesaurus list.

Parameter Type Default Description
lemma string REQUIRED
lpos see wsketch
maxthesitems integer 60 maximum number of items
clusteritems integer (boolean) 0 see wsketch
minsim see wsketch

Note: thes_form method (that returns the thesaurus input form) is related to this method.

wsdiff

This method provides Sketch difference.

  • lemma – first lemma. This attribute is required.
  • lemma2 – second lemma. This attribute is required.
  • lpos – part of speech in notation ‘-n’, ‘-v’, … (but the particular notation depends on corpus). If the corpus contains the “lempos” attribute, it is required, else it has no effect.
  • sort_gramrels – “sort grammatical relation” mark. Values ‘0’, ‘1’ (default)
  • separate_blocks – “separate blocks” mark. ‘1’ (default) = “common/exclusive blocks”, ‘0’ = “all in 1 block”
  • minfreq – minimum frequency in corpus. Default is ‘auto’ that is a function of corpus size. Other possible values are natural numbers.
  • maxcommon – maximum number of items in a grammatical relation of the common block (default 25)
  • maxexclusive – maximum number of items in a grammatical relation of the exclusive block

Note: wsdiff_form method (that returns the Sketch-Diff input form) is related to this method.

view

This method provides access to concordance lines and all possibilities of sorting, sample selecting and filtering of them. It operates in two modes:

  1. asynchronous (default) – the computation is started in background and the request returns immediately or as soon as the required number of concordance lines is available. The required number of concordance lines is the product of the fromp and pagesize attributes (see below).
  2. synchronous – the request does not return until the whole concordance is computed. To enable this, pass async=0.

The basic attribute is the q attribute that contains a list of search queries, that are processed incrementally. A list of queries can be transferred through the CGI interface as ‘q=item1;q=item2…’; another possibility is to use the JSON interchange format, see the following sections. The first query specifies the basic search query, the next ones specify sorting and filtering options. The construction of a query is not trivial and therefore we will describe it here more precisely. The content of the q attribute is a string of the following structure:

<query_sign><query>

where specifies the type of query and it is one char from the set {‘q’, ‘a’, ‘r’, ‘s’, ‘n’, ‘p’, ‘w’} (‘q’, ‘a’ and ‘w’ queries can be used as the basic search query, the others behave as filters). The rest of the query depends on the , as follows.

Basic search queries:

  • q – is followed by a common CQL query with all its possibilities. Examples:
q[lemma="drug"]
q[lemma="drug"][lemma="test"] within <s>
q[lemma="drug"][lemma="test"] within <s>

(there is no difference between the last two examples, they just demonstrate that spaces can be used within the CQL query but they are not required)

  • a – the same like q but it is possible to specify the default attribute. Syntax and example:
a<default_attribute>,<CQL_query>
---------------------------------
alemma,"drug" [tag="N.*"]
  • w – query from Word Sketch. This is used in links from word sketch tables to concordances. The ‘w’ character is followed by a number ID that specifies lines that match a particular word sketch relation. The ID can be pulled from the field ‘seek’ in the Word Sketch JSON output (see the next sections). More comma-delimited IDs can be specified; in this case, the result is union. Example:
w4816743
w,4816743,4816826

Sorting and filtering options:

  • r – selecting a random sample from the concordance. The ‘r’ character is followed by a natural number or percentage that specifies the size (number of lines) of the sample. Examples:
r250
r20%
  • s – sorting the concordance. Syntax:
s<attribute>/<marks><space><sort_range>
s<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
s<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
  or
s*<number>

The first three patterns stand for sorting options available under the “Sort” menu in the web interface. As can be seen from the patterns 2 and 3, also the multilevel sorting options are available. The last pattern indicates sorting according to GDEX (good dictionary examples) selection;  stands for a natural number with meaning “number of lines to be sorted”. Please note that it is needed to have the gdex_enabled option set to 1 (see below) to be able to use this sorting.
Legend to the first three patterns:

  •  is the particular corpus attribute used. It can also be a structure attribute, e.g. ‘doc.id’ for sorting according to the document IDs.
  •  can be ‘i’, ‘r’, ‘ir’ or empty (“”) which means “ignore case”, “reverse order”, both of them or none of them
  •  is the space character (‘ ‘)
  •  is either a position or a range.
  • Positions can be referenced as follows:
    • integer number – where 0 is the first token in KWIC, -1 the rightmost token in the left context etc.
    • 1:x – where x is one of the corpus structures (e.g. “doc” or “s” if the corpus has the particular markup). Its meaning is the first token in the structure, except when it is the right boundary of a range – then it is the last token in the structure. Also, other numbers can be used, e.g. -2:x, 3:x, etc. (-1 is the same as 1 with meaning “structure containing KWIC”)
    • a<0 – where ‘a’ stands for a position reference as described in the first two points with meaning “‘a’ positions before/after the first KWIC position” (so this is equivalent to ‘a’)
    • a>0 – where ‘a’ stands for the same position reference with meaning “positions before/after the last KWIC position”
    • in the previous two points, if ‘0’ is substituted with a natural number ‘k’, it means “before/after ‘k’-th collocation” instead of “before/after KWIC”. Collocations are special token groups in the context, that can be added using positive filters (see below)

    Ranges can be referenced as a~b where ‘a’, ‘b’ stand for token identifiers as above. Examples of positions and ranges:

    • -1<0 – rightmost token in the left context
    • 3>0 – third token in right context
    • 0>0 – last token in KWIC
    • 0<0 – first token in KWIC
    • 0<0~0>0 – range of KWIC
    • -1<0~1>0 – range of KWIC with one token from the left context and one from the right context
    • 1:s – first token in the sentence containing KWIC (or its first token)
    • 1:s>0 – first token in the sentence containing KWIC (or its last token)
    • 0<1 – first token of the first-added collocation

Examples:

s*100
sword/ 1>0~3>0
sword/ 1>0~3>0
slemma/ 0<0~0>0
sword/i -1
sword/ 0 word/ir -1<0 tag/r -2<0
  • n – negative filter. Syntax:
n<position><space><position><space><selected_token><space><CQL_query>
  • where:
    •  stands for position reference as explained in the “s” section
    •  is the space character
    •  stands for “selected token”. Values ‘-1’ = last, ‘1’ = first
    •  stands for a query that – if found between the two specified positions – filters out the particular line of the concordance

Examples:

n-5 -1 -1 [lemma="drug"]
n-5 -1 -1 [lc="drug" & tag="J.*"]
  • p – positive filter; similar to the negative filter above. Syntax and example:
p<position><space><position><space><selected_token><space><CQL_query>
-----------------------
p-1 -1 -1 [word="drug"]
  • F – filtering the first occurrences of a query within a structure. Syntax and example:
F<structure>
-----------------------
Fbncdoc

Other attributes of the “view” method:

  • async – if set to 1 the result is processed asynchronously which means that you obtain an initial part of the result before the complete result is computed; by repeating the same call you may receive a bigger result; once the query is fully processed, you receive finished: 1 in the result. In majority of cases it is recommended to turn it off (async=0); default 1
  • pagesize – size (number of lines) of the resulting concordance. Default 20
  • fromp – number of the page that is returned. Default 1
  • kwicleftctx – size of the left context in KWIC view. Can be expressed as:
    •  – number of tokens
    • # – number of characters (note that the ‘#’ character must be escaped in URLs), e.g. ’40#’ (default value)
    • :<tag> – structural context, e.g. ‘-1:s’ stands for left context of the whole sentence. In the left context, should be negative
  • kwicrightctx – size of the right context, similar. should be positive in the case of structural notation
  • viewmode – “KWIC” / “sentence” view mode. Values: ‘kwic’ (default), ‘sen’
  • attrs – comma-delimited list of attributes that are returned for KWIC tokens. The set of available attributes depends on the corpus. Examples of values: ‘word’ (default), ‘word,lemma’, ‘lemma,tag,word’ etc.
  • ctxattrs – comma-delimited list of attributes that are returned for context tokens. Examples of values: ‘word’ (default), ‘word,lemma’, ‘lemma,tag,word’ etc.
  • structs – comma-delimited list of structure tags that are returned/applied. Default: ‘p,g’
  • gdex_enabled – enables the GDEX sorting. Values: 0 (default) = disabled, 1 = enabled
  • refs – comma-delimited list of items returned in the “references” field. Default is ‘#’ that stands for token number or value of option SHORTREF defined in the corpus configuration file. Other possible values are:
<attribute>
=<attribute>
  • where  is an attribute of one of the corpus structures, e.g. doc.id s.n … The first notation displays the information in name=value format, the second one returns only the value.

Note: first, reduce, filter, viewattrsx, mlsortx, sortx are method that return the same output as the “view” method using attributes and values taken from forms provided by method first_form, reduce_form, viewattrs, sort. For example, the first method can take attribute lemma and does not need attribute q. However, they are here mainly for more comfortable work with graphical interface and are not universal. For this reason we will not describe them here.

freqs

This method provides access to the frequency statistics.

Attributes:

  • q – query list, the same as for the “view” method. This attribute is required.
  • fcrit – object of frequency query, i.e. “frequency of what are you looking for?” (This attribute is required.) Syntax of values of this attribute is very similar to the sorting queries with the “view” method:
<attribute>/<marks><space><sort_range>
<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
  • with all being the same as at sorting options except that can be only ‘i’ or empty (“”) Examples of possible values with explanation:
    • tag 0~0>0 – frequency of tags of all KWIC tokens
    • tag 0 – frequency of tags of first KWIC tokens
    • word/ 0 lemma/i -1<0 – (multilevel) frequency of first word in KWIC and last lemma in the left context (with ignored case on)

fcrit can be also a list, if so, the output contains more blocks.

  • flimit – frequency limit. Default 0
  • freq_sort – identifier of column according to which should be the output sorted (its number counted from 0) or ‘freq’ (default), that means sorting according to frequency, or ‘rel’ that means sorting according to the “Rel[%]” column (if displayed)
  • ml – specifies if the “Rel[%]” column will be displayed. ‘0’ (default) stands for yes, ‘1’ for no. (“ml” stands for “Multi-Level style”)

Note: freqml, freqtt are method that return the same output as the “freqs” method using attributes and values taken from forms provided by method freq. The situation is similar as by the “view” method, therefore we only mention thesemethod and will not describe them in detail.

collx

This method provides collocation candidates computation.

Attributes:

  • q – query list, the same as for the “view” method. This attribute is required.
  • cattr – corpus attribute that is the computation performed over. Default is ‘word’
  • cfromw – search range – “from” – in token index (only integer numbers allowed). Default -5
  • ctow – search range – “to” – similar. Default 5
  • cminfreq – minimum frequency in corpus. Default 5
  • cminbgr – minimum frequency in given range. Default 3
  • cmaxitems – maximum number of displayed lines. Default 50
  • cbgrfns – list of displayed functions in the form: cbgrfns=f1;cbgrfns=f2;… Default [‘t’, ‘m’]
  • csortfn – function according to which the result is sorted. Default ‘f’.

Notation of the functions:

  • t – T-score
  • m – MI
  • 3 – MI3
  • l – log likelihood
  • s – min. sensitivity
  • c – salience
  • f – frequency

Note: coll method (that returns the collocation candidates input form) is related to this method.

save* method

This group of method includes: savecoll, saveconc, savefreq, savethes, savewl, savews. These functions provide plain text or XML output of the system, i.e. of functions collx, view, freqs, thes, wordlist, wsketch. Each of the save* functions takes the same attributes as its “mother” method. The common attributes of the save* functions are as follows:

  • saveformat – specifies the format of the output. Values: ‘text’ (default), ‘xml’
  • heading – specifies if a simple heading (corpus name, query etc.) will be included. Values: ‘1’, ‘0’ (default)

The saveconc method is associated with few more attributes:

  • pages – indicates if the whole concordance will be saved (value ‘0’, default), or particular page only (value ‘1’)
  • numbering – indicates if the concordance lines will be numbered. Values ‘1’, ‘0’ (default)
  • align_kwic – indicates if a simple alignment method of KWIC tokens will be used. Values ‘1’, ‘0’ (default). Relevant only in combination with text output
  • maxsavelines – maximum number of saved lines. Default 1000
  • leftctxrightctx – use these instead of kwicleftctxkwicrightctx attributes of the view method

Note: savecoll_form, saveconc_form, savefreq_form, savethes_form, savewl_form, savews_form method (that return the particular forms) are related to this method.

subcorp

This method performs creation and deletion of subcorpora.

Attributes:

  • subcname – name of the new subcorpus (or subcorpus being deleted respectively). Default None (no operation with subcorpora).
  • delete – if not empty (that is default), delete subcorpus instead of creation it
  • corpus structural attributes and their values can be here used as attributes and values of the method. The selected values define the span of the subcorpus.

Note: subcorp_form method (that returns the subcorpus input form) is related to this method.