Back to API overview

Methods and specific attributes

In this section, all method are listed and attributes specific to each method are described. The “universal attributes” (that can be used with all method) are described at the end of this page. Note that some characters (e.g. space) that can be contained in the attribute values must be escaped.

For more information, see e.g. http://en.wikipedia.org/wiki/Percent-encoding. The output of these methods is on the JSON API documentation page.

wordlist

This method provides a functionality of “word list” and “keywords” functions that are normally available under the link “Word List” in the web interface.

Attributes:

  • keywords – if empty, “word list” is returned, else “keywords” function is used
  • wlattr – corpus attribute that you want to work with. This attribute is required.
  • wlnums – defines a type of frequency figures – possible values are: frq, docf, arf (for word frequency, document frequency and ARF accordingly)
  • wlminfreq – minimum frequency in corpus (default 5)
  • wlmaxitems – maximum number of displayed lines (default 100)
  • wlpat – regular expression that specifies the word list pattern (default ‘.*’ – all words). Relevant only in combination with “word list” function.
  • wlicase – “ignore case” mark. Values ‘1’, ‘0’ (default). Relevant only in combination with “word list” function.
  • wlsort – if ‘f’, resulting word list is sorted according to frequency. Else alphabetically according to attribute (default). Relevant only in combination with “word list” function.
  • ref_corpname – corpus name (in the short form, e.g. ‘bnc’) of the reference corpus. Relevant only in combination with “keywords” function. In this case, it is required.
  • ref_usesubcorp – reference subcorpus name. Relevant only in combination with “keywords” function.
  • wlfile – allows to send a file with whitelist via POST request
  • wlblacklist – allows to send a file with blacklist via POST request

Note: wordlist_form method (that returns the word list input form) is related to this method.

wsketch

This method returns the word sketch tables.

Attributes:

  • lemma – lemma. This attribute is required.
  • lpos – part of speech in notation ‘-n’, ‘-v’, … (but the particular notation depends on a corpus). If the corpus contains the “lempos” attribute and lpos attribute is omitted, it is automatically replaced by the most frequent lpos for the specified lemma. Otherwise, it has no effect.
  • sort_gramrels – “sort grammatical relation” mark. Values ‘0’, ‘1’ (default)..
  • minfreq – minimum frequency in the corpus. The default is ‘auto’ that is a function of corpus size. Other possible values are natural numbers.
  • minscore – minimum salience. Default 0.0.
  • maxitems – maximum number of items in a grammatical relation. Default 25.
  • clustercolls – “cluster collocations” mark. Values ‘1’, ‘0’ (default)
  • minsim – minimum similarity between cluster items. Default 0.15. Relevant only when “clustercolls” is set to ‘1’

Note: wsketch_form method (that returns the word sketch input form) is related to this method.

thes

This method returns the thesaurus list.

Attributes:

  • lemma – lemma. This attribute is required.
  • lpos – the same attribute as at “wsketch” function.
  • maxthesitems – the maximum number of items. Default 60.
  • clusteritems – “cluster items” mark. Values ‘1’, ‘0’ (default)
  • minsim – the minimum similarity between cluster items. Default 0.15. Relevant only when “clusteritems” is set to ‘1’

Note: thes_form method (that returns the thesaurus input form) is related to this method.

wsdiff

This method provides “Sketch-Dif” tables.

Attributes:

  • lemma – first lemma. This attribute is required.
  • lemma2 – second lemma. This attribute is required.
  • lpos – part of speech in notation ‘-n’, ‘-v’, … (but the particular notation depends on corpus). If the corpus contains the “lempos” attribute, it is required, else it has no effect.
  • sort_gramrels – “sort grammatical relation” mark. Values ‘0’, ‘1’ (default)
  • separate_blocks – “separate blocks” mark. ‘1’ (default) = “common/exclusive blocks”, ‘0’ = “all in 1 block”
  • minfreq – minimum frequency in corpus. Default is ‘auto’ that is a function of corpus size. Other possible values are natural numbers.
  • maxcommon – maximum number of items in a grammatical relation of the common block (default 25)
  • maxexclusive – maximum number of items in a grammatical relation of the exclusive block

Note: wsdiff_form method (that returns the Sketch-Diff input form) is related to this method.

view

This method provides access to concordance lines and all possibilities of sorting, sample selecting and filtering of them.

Recently we have added asynchronous processing of queries so in some cases (complex queries) you may get incomplete results which will grow in size if you repeat the same query. You can disable this feature using parameter async (see below).

The basic attribute is the q attribute that contains a list of search queries, that are processed incrementally. A list of queries can be transferred through the CGI interface as ‘q=item1;q=item2…’; another possibility is to use the JSON interchange format, see the following sections. The first query specifies the basic search query, the next ones specify sorting and filtering options. The construction of a query is not trivial and therefore we will describe it here more precisely. The content of the q attribute is a string of the following structure:

<query_sign><query>

where <query_sign> specifies the type of query and it is one char from the set {‘q’, ‘a’, ‘r’, ‘s’, ‘n’, ‘p’, ‘w’} (‘q’, ‘a’ and ‘w’ queries can be used as the basic search query, the others behave as filters). The rest of the query depends on the <query_sign>, as follows.

Basic search queries:

  • q – is followed by a common CQL query with all its possibilities. Examples:
q[lemma="drug"]
q[lemma="drug"][lemma="test"] within <s>
q[lemma="drug"][lemma="test"]within<s>

(there is no difference between the last two examples, they just demonstrate that spaces can be used within the CQL query but they are not required)

  • a – the same like q but it is possible to specify the default attribute. Syntax and example:
    a<default_attribute>,<CQL_query>
    ---------------------------------
    alemma,"drug" [tag="N.*"]
    
  • w – query from Word Sketch. This is used in links from word sketch tables to concordances. The ‘w’ character is followed by a number ID that specifies lines that match a particular word sketch relation. The ID can be pulled from the field ‘seek’ in the Word Sketch JSON output (see the next sections). More comma-delimited IDs can be specified; in this case, the result is union. Example:
    w4816743
    w,4816743,4816826
    

Sorting and filtering options:

  • r – selecting a random sample from the concordance. The ‘r’ character is followed by a natural number or percentage that specifies the size (number of lines) of the sample. Examples:
    r250
    r20%
    
  • s – sorting the concordance. Syntax:
    s<attribute>/<marks><space><sort_range>
    s<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
    s<attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
      or
    s*<number>
    
    

    The first three patterns stand for sorting options available under the “Sort” menu in the web interface. As can be seen from the patterns 2 and 3, also the multilevel sorting options are available. The last pattern indicates sorting according to GDEX (good dictionary examples) selection; <number> stands for a natural number with meaning “number of lines to be sorted”. Please note that it is needed to have the gdex_enabled option set to 1 (see below) to be able to use this sorting.
    Legend to the first three patterns:

    • <attribute> is the particular corpus attribute used. It can also be a structure attribute, e.g. ‘doc.id’ for sorting according to the document IDs.
    • <marks> can be ‘i’, ‘r’, ‘ir’ or empty (“”) which means “ignore case”, “reverse order”, both of them or none of them
    • <space> is the space character (‘ ‘)
    • <sort_range> is either a position or a range.
    • Positions can be referenced as follows:
      • integer number – where 0 is the first token in KWIC, -1 the rightmost token in the left context etc.
      • 1:x – where x is one of the corpus structures (e.g. “doc” or “s” if the corpus has the particular markup). Its meaning is the first token in the structure, except when it is the right boundary of a range – then it is the last token in the structure. Also, other numbers can be used, e.g. -2:x, 3:x, etc. (-1 is the same as 1 with meaning “structure containing KWIC”)
      • a<0 – where ‘a’ stands for a position reference as described in the first two points with meaning “‘a’ positions before/after the first KWIC position” (so this is equivalent to ‘a’)
      • a>0 – where ‘a’ stands for the same position reference with meaning “positions before/after the last KWIC position”
      • in the previous two points, if ‘0’ is substituted with a natural number ‘k’, it means “before/after ‘k’-th collocation” instead of “before/after KWIC”. Collocations are special token groups in the context, that can be added using positive filters (see below)

      Ranges can be referenced as a~b where ‘a’, ‘b’ stand for token identifiers as above. Examples of positions and ranges:

      • -1<0 – rightmost token in the left context
      • 3>0 – third token in right context
      • 0>0 – last token in KWIC
      • 0<0 – first token in KWIC
      • 0<0~0>0 – range of KWIC
      • -1<0~1>0 – range of KWIC with one token from the left context and one from the right context
      • 1:s – first token in the sentence containing KWIC (or its first token)
      • 1:s>0 – first token in the sentence containing KWIC (or its last token)
      • 0<1 – first token of the first-added collocation

Examples:

s*100
sword/ 1>0~3>0
sword/ 1>0~3>0
slemma/ 0<0~0>0
sword/i -1
sword/ 0 word/ir -1<0 tag/r -2<0
  • n – negative filter. Syntax:
    n<position><space><position><space><selected_token><space><CQL_query>
    

    where:

    • <position> stands for position reference as explained in the “s” section
    • <space> is the space character
    • <selected_token> stands for “selected token”. Values ‘-1’ = last, ‘1’ = first
    • <CQL_query> stands for a query that – if found between the two specified positions – filters out the particular line of the concordance

Examples:

n-5 -1 -1 [lemma="drug"]
n-5 -1 -1 [lc="drug" & tag="J.*"]
  • p – positive filter; similar to the negative filter above. Syntax and example:
    p<position><space><position><space><selected_token><space><CQL_query>
    -----------------------
    p-1 -1 -1 [word="drug"]
    

Other attributes of the “view” method:

  • async – if set to 1 the result is processed asynchronously which means that you obtain an initial part of the result before the complete result is computed; by repeating the same call you may receive a bigger result; once the query is fully processed, you receive finished: 1 in the result. In majority of cases it is recommended to turn it off (async=0); default 1
  • pagesize – size (number of lines) of the resulting concordance. Default 20
  • fromp – number of the page that is returned. Default 1
  • kwicleftctx – size of the left context in KWIC view. Can be expressed as:
    • <number> – number of tokens
    • <number># – number of characters (note that the ‘#’ character must be escaped in URLs), e.g. ’40#’ (default value)
    • <structure_number>:<tag> – structural context, e.g. ‘-1:s’ stands for left context of the whole sentence. In the left context, <structure_number> should be negative
  • kwicrightctx – size of the right context, similar. <structure_number> should be positive in the case of structural notation
  • viewmode – “KWIC” / “sentence” view mode. Values: ‘kwic’ (default), ‘sen’
  • attrs – comma-delimited list of attributes that are returned for KWIC tokens. The set of available attributes depends on the corpus. Examples of values: ‘word’ (default), ‘word,lemma’, ‘lemma,tag,word’ etc.
  • ctxattrs – comma-delimited list of attributes that are returned for context tokens. Examples of values: ‘word’ (default), ‘word,lemma’, ‘lemma,tag,word’ etc.
  • structs – comma-delimited list of structure tags that are returned/applied. Default: ‘p,g’
  • gdex_enabled – enables the GDEX sorting. Values: 0 (default) = disabled, 1 = enabled
  • refs – comma-delimited list of items returned in the “references” field. Default is ‘#’ that stands for token number or value of option SHORTREF defined in the corpus configuration file. Other possible values are:
    <attribute>
    =<attribute>
    

    where <attribute> is an attribute of one of the corpus structures, e.g. doc.id , s.n … The first notation displays the information in name=value format, the second one returns only the value.

Note: first, reduce, filter, viewattrsx, mlsortx, sortx are method that return the same output as the “view” method using attributes and values taken from forms provided by method first_form, reduce_form, viewattrs, sort. For example, the first method can take attribute lemma and does not need attribute q. However, they are here mainly for more comfortable work with graphical interface and are not universal. For this reason we will not describe them here.

freqs

This method provides access to the frequency statistics.

Attributes:

  • q – query list, the same as for the “view” method. This attribute is required.
  • fcrit – object of frequency query, i.e. “frequency of what are you looking for?” (This attribute is required.) Syntax of values of this attribute is very similar to the sorting queries with the “view” method:
    <attribute>/<marks><space><sort_range>
    <attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
    <attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range><space><attribute>/<marks><space><sort_range>
    
    

    with all being the same as at sorting options except <marks> that can be only ‘i’ or empty (“”) Examples of possible values with explanation:

    • tag 0~0>0 – frequency of tags of all KWIC tokens
    • tag 0 – frequency of tags of first KWIC tokens
    • word/ 0 lemma/i -1<0 – (multilevel) frequency of first word in KWIC and last lemma in the left context (with ignored case on)

fcrit can be also a list, if so, the output contains more blocks.

  • flimit – frequency limit. Default 0
  • freq_sort – identifier of column according to which should be the output sorted (its number counted from 0) or ‘freq’ (default), that means sorting according to frequency, or ‘rel’ that means sorting according to the “Rel[%]” column (if displayed)
  • ml – specifies if the “Rel[%]” column will be displayed. ‘0’ (default) stands for yes, ‘1’ for no. (“ml” stands for “Multi-Level style”)

Note: freqml, freqtt are method that return the same output as the “freqs” method using attributes and values taken from forms provided by method freq. The situation is similar as by the “view” method, therefore we only mention thesemethod and will not describe them in detail.

collx

This method provides collocation candidates computation.

Attributes:

  • q – query list, the same as for the “view” method. This attribute is required.
  • cattr – corpus attribute that is the computation performed over. Default is ‘word’
  • cfromw – search range – “from” – in token index (only integer numbers allowed). Default -5
  • ctow – search range – “to” – similar. Default 5
  • cminfreq – minimum frequency in corpus. Default 5
  • cminbgr – minimum frequency in given range. Default 3
  • cmaxitems – maximum number of displayed lines. Default 50
  • cbgrfns – list of displayed functions in the form: cbgrfns=f1;cbgrfns=f2;… Default [‘t’, ‘m’]
  • csortfn – function according to which the result is sorted. Default ‘f’.

Notation of the functions:

  • t – T-score
  • m – MI
  • 3 – MI3
  • l – log likelihood
  • s – min. sensitivity
  • c – salience
  • f – frequency

Note: coll method (that returns the collocation candidates input form) is related to this method.

save* method

This group of method includes: savecoll, saveconc, savefreq, savethes, savewl, savews. These functions provide plain text or XML output of the system, i.e. of functions collx, view, freqs, thes, wordlist, wsketch. Each of the save* functions takes the same attributes as its “mother” method. The common attributes of the save* functions are as follows:

  • saveformat – specifies the format of the output. Values: ‘text’ (default), ‘xml’
  • heading – specifies if a simple heading (corpus name, query etc.) will be included. Values: ‘1’, ‘0’ (default)

The saveconc method is associated with few more attributes:

  • pages – indicates if the whole concordance will be saved (value ‘0’, default), or particular page only (value ‘1’)
  • numbering – indicates if the concordance lines will be numbered. Values ‘1’, ‘0’ (default)
  • align_kwic – indicates if a simple alignment method of KWIC tokens will be used. Values ‘1’, ‘0’ (default). Relevant only in combination with text output
  • maxsavelines – maximum number of saved lines. Default 1000
  • leftctx, rightctx – use these instead of kwicleftctx, kwicrightctx attributes of the view method

Note: savecoll_form, saveconc_form, savefreq_form, savethes_form, savewl_form, savews_form method (that return the particular forms) are related to this method.

subcorp

This method performs creation and deletion of subcorpora.

Attributes:

  • subcname – name of the new subcorpus (or subcorpus being deleted respectively). Default None (no operation with subcorpora).
  • delete – if not empty (that is default), delete subcorpus instead of creation it
  • corpus structural attributes and their values can be here used as attributes and values of the method. The selected values define the span of the subcorpus.

Note: subcorp_form method (that returns the subcorpus input form) is related to this method.

Universal attributes

There are few attributes that can be used with any method:

  • corpname – corpus name (in the short form, e.g. ‘bnc’). This attribute specifies the corpus that will be processed and is required in all method. You can query your own corpus (e.g. username user3048, corpusmylittlecorpus), just use corpname="user/user3048/mylittlecorpus".
  • usesubcorp – name of subcorpus that will be processed. Default is empty (“”) that means working with the entire corpus
  • format – the format of the output. Default is empty that is interpreted as HTML. The only (so far) different possible value is ‘json’ that means output in the JSON format (see below). Option ‘json’ is currently not supported with the “wsdiff” method.
  • json – all input attributes encoded as a string in JSON syntax

Back to API overview