Corpus configuration options

NAME

name of the corpus; defaults to the corpus config filename

ENCODING

corpus encoding

LANGUAGE

language name – it should be capitalised and one of the allowed names, otherwise the system will not be able to automatically detect the right locale and you may experience errors when sorting or regular expression matching of non-ASCII characters.

NOLETTERCASE

optional parameter switching off match case for languages not distinguish upper and lower cases, e.g. Arabic, Chinese, Japanese, Korean, Nepali, Telugu, Tamil, …

LOCALE

locale code of a used language (and region), this value is used in the query evaluation (of regular expressions) and the concordance lines sorting, the default locale is standard Posix locale (`C’)

RIGHTTOLEFT

indicates whether the language of the corpus is in the right-to-left script (e.g. Arabic)

ALIGNED

for parallel corpora only: comma-separated list of aligned corpora. All corpora should have a structure defined in ALIGNSTRUCT (“align” in Manatee < 2.67).

ALIGNSTRUCT

(added in manatee 2.67) for parallel corpora only: the name of the mapping structure, i.e. such a structure that is present in both parallel corpora and on which the alignment is performed. Defaults to “align”.

ALIGNDEF

(added in manatee 2.67) for parallel corpora only: comma-separated list of mapping definition files to aligned corpora.

NEWVERSION

for old versions of corpora only: the name of the new version of the corpus

DEFAULTATTR

default attribute for CQL query evaluation. It is also used to map attribute alias “-” in the web API.

MAINTAINER

identification of the person responsible for maintaining the corpus

MAINTAINER

identification of the person responsible for maintaining the corpus

DOCSTRUCTURE

the structure that should be considered to be a document, defaults to “doc”.

NONWORDRE

a regular expression determining which tokens should not be considered words, defaults to [^[:alpha:]].* – therefore the default definition of a word is [[:alpha:]].*.

WSTRANSLATE

configuration of languages and corpora for bilingual word sketch using the “Translate button”. Syntax:

<delimiter><lang1><delimiter><corpus1><delimiter><lang2><delimiter><corpus2>...

</>
(delimiter being a single character, as in WPOSLIST or LPOSLIST). Example:

WSTRANSLATE ",French,frtenten,German,detenten2_simplews,Polish,pltenten,Spanish,eseutenten11_freeling,Italian,ittenten"

</>
Appropriate dictionaries named <lang1>-<lang2> are expected in pcdict_path (/corpora/pcdicts or as specified in run.cgi).

DIACHRONIC

set a structure attribute containing a date for computing trends. Values must be in numeral format, e.g. 2016/01 where still the same delimiters (e.g. /) always are in the same places


Location features

PATH

full path of the corpus home directory which contains all data files

INFO

arbitrary corpus information like source, size etc. There is no automatic processing of this data. If the value begins with the “@” character the rest is taken as a full path of a file containing INFO data

INFOHREF

link to arbitrary documentation on the web

VERTICAL

full path of the source vertical text, it is used only by “encodevert” program, if the value starts with “|” the rest is treated as a shell command, and the vertical text will be taken from standard output of the command

WSBASE

path to compiled word sketches data files, defaults to PATH/WSATTR-ws (prefix), use “none” to disable Word Sketch menu items

WSDEF

path to the word sketches grammar definition file

WSHIST

path to the WS highlights definition file for the findX utility (was:histograms)

WSTHES

path to word sketches thesaurus data files, defaults to PATH/WSATTR-thes

TERMBASE

path to compiled term data files, defaults to PATH/terms-ws (prefix)

TERMDEF

path to term grammar definition file

TAGSETDOC

URL of the tagset documentation, so users can quickly refer to it from a button next to the CQL box in the search interface. If absent, the button does not appear in the interface

SUBCDEF

path for the subcorpus definition file. See Subcorpus config documentation. A subcorpus definition file allows you to share subcorpora with all users of the corpus

SUBCBASE

path for global subcorpora, default PATH/subcorp

GDEXDEFAULTCONF

path to the default GDEX configuration for the given corpus (used only if GDEX is installed)


Structures and Attributes

ATTRIBUTE

This provides the definition of a positional attribute. At least one positional attribute should be defined. The first defined attribute is the default one (in most cases it is the word form and the name of this attribute is “word”). The order is important: the nth ATTRIBUTE in the corpus config file provides a name for the contents of the nth column in the vertical file. Some features of SkE require attributes called ‘tag’, ‘lemma’, ‘lempos’, ‘lc’. The order of attributes is not important, it is used only during the initial encoding and to display the list of attributes in the concordance “View options” form. Attribute names must start with an alphabetic character or underscore and subsequent characters must be alphanumerical (including underscore). i.e. (‘a’..’z’|’A’..’Z’|’_’)(‘a’..’z’|’A’..’Z’|’0′..’9’|’_’)*

STRUCTURE

This provides the definition of a structural tag. Structures can themselves have attributes (structural attributes as opposed to the positional attribute described above). Structure names must start with an alphabetic character or underscore and subsequent characters must be alphanumerical (including underscore). i.e. the same criteria as ATTRIBUTE names above.

ATTRIBUTE and STRUCTURE options can be repeated and enriched with an additional information block, for example with:

  • MULTIVALUE

indicate whether the attribute has multivalues

  • DEFAULTVALUE

default value for this attribute if not present in the source vertical
[since manatee 2.30] if not overridden by this configuration option, the default DEFAULTVALUE is set to “===NONE===”.

  • MULTISEP

defines multivalue separator, if empty (“”) value is split into characters

  • HIERARCHICAL

states that the attribute should be treated as hierarchical. Its value is the separator of the fields in the hierarchy (can be any string). For structural attributes (header fields) only.

  • ATTRDOC

optional link to the attribute values documentation. For structural attributes (header fields) only.

  • ATTRDOCLABEL

name for the ATTRDOC link.

  • NUMERIC

indicate that attribute values will be sorted according to their numeric value. For structural attributes (header fields) only.

Advanced topics on attributes and structures, see below.


 

Controlling display in concordances

SHORTREF

the attribute of a structure to display as a default reference in the left hand column of a concordance. Defaults to the first attribute of the first structure or “#” (token number) if no attribute of a structure exists. The syntax is SHORTREF “=structure.attribute”, e.g. “=doc.id” for displaying only the value of “doc.id” or SHORTREF “structure.attribute” (without equal sign) for displaying the pair “structure.attribute=value”. There can be multiple links in SHORTREF, e.g. SHORTREF “=bncdoc.id,#,bncdoc.year” has a reference “J0P,#507890,bncdoc.year=1977”.

SIMPLEQUERY

template for the CQL query that is created from the simple query. Defaults to [lc=”%s” | lemma_lc=”%s”] (if the *lc attributes are present), otherwise [word=”%s” | lemma=”%s”] if word and lemma are present, and [word=”%s”] if only word is present. Any CQL query template can be used. The string “%s” is replaced by the actual content of the simple query field.

STRUCTATTRLIST

comma-separated list of references that will be used to determine the References list in view options. Defaults to all attributes of the structures specified in the config file.

FREQTTATTRS

comma-separated list of structure attributes that will be used for Frequency -> Text types in the concordance view. Defaults to SUBCORPATTRS. New in bonito version 3.53.

FULLREF

comma-separated list of references which will be displayed as a full reference at the bottom of the window when the user clicks on the SHORTREF for a concordance line. Defaults to the value of STRUCTATTRLIST.

HARDCUT

maximum number of query result lines in query evaluation, default=0 meaning no limit

MAXKWIC

[since Manatee 2.107]
maximum number of positions in the KWIC of a concordance, default=100 (if you want unlimited KWIC use MAXKWIC=0)

MAXCONTEXT

maximum number of positions in context for displaying and saving concordance, default=100 (if you want unlimited context use MAXCONTEXT=0)

MAXDETAIL

maximum number of positions in the detail view (at the bottom of conc view), default=MAXCONTEXT

STRUCTCTX

display the whole structure in the detail view (at the bottom of conc view)

WRAPDETAIL

name of the structure that will cause line wrap in the detail context window (new in bonito 2.76), default none

Attributes of structures

In an additional information block of a STRUCTURE option there can be arbitrary many ATTRIBUTE options (with possible additional option blocks), which can include the following:

LABEL

label used in references instead of <STRUCTURE>.<ATTRIBUTE>

DISPLAYTAG

if “1” (by default) it displays an XML tags like

<s>, <p>

in concordances; set it to “0” not to display a tag, use other DISPLAY… options to modify concordance output

DISPLAYCLASS

a class of included text; can be used to change style of text in a structure, but to do that also requires adding the given class in the cascading style sheet view.css on the server, for example

STRUCTURE g {
     DISPLAYCLASS "bold"
}

could be used to display heading in bold. Default classes available: concred (red text), concgreen (green text).

DISPLAYBEGIN

for example one can display quotation mark instead of <q> and </q>

special value “_EMPTY_” means display nothing and eat spaces, it is used for <g/>:

STRUCTURE g {
	DISPLAYTAG 0
	DISPLAYBEGIN "_EMPTY_"
}

[since manatee 2.28] structure attributes can be displayed using the %(attribute_name) syntax, e.g. if you’d like the structure to be marked by the text “STR-” concatenated with the id attribute of structure str, use the following syntax:

STRUCTURE str {
        ATTRIBUTE id
	DISPLAYTAG 0
	DISPLAYBEGIN "STR-%(id)"
}

DISPLAYEND

same as DISPLAYBEGIN only for the end tag

MAXLISTSIZE

in text types, if an attribute has more than 22 possible options, an input text field with autolookup is offered to user rather than a list of checkboxes. MAXLISTSIZE can change the default value. Example:

STRUCTURE document {
       ATTRIBUTE id
       ATTRIBUTE domain {
               MAXLISTSIZE "30"
       }
}

</>

NESTED

Enables nested structures. Note that nested structures may not be supported in all functions and in some cases they may cause a mismatch in frequency figures. They are primarily designed to enable nested error annotation in learner corpora. Nesting is limited to the depth of 100 levels and deep (> 10 levels) nesting may have a noticeable negative performance impact. Generally, it should be avoided if possible by defining a structure for each level of nesting, e.g. section, subsection, subsubsection etc.

STRUCTURE err {
    NESTED 1
}

Controlling Text Types (concordance form and subcorpus creation)

SUBCORPATTRS

Comma-separated list of structure attributes displayed in the query form and in frequency by text types, if FREQTTATTRS is not set. It also determines attributes available for creating subcorpora in the user interface. Use “|” instead of comma to display attributes on the same row in the subcorpus creation form. Use “|*” instead of a comma to put the next attribute right under the previous attribute (rather than on a new line — introduced in bonito 3.90)  Example:

SUBCORPATTRS "bncdoc.alltyp|bncdoc.alltim|*bncdoc.id,bncdoc.wridom|bncdoc.wrimed"

— subcorp form contains 2 rows:

1: alltyp and alltim+id 2: wridom and wrimed

If SUBCORPATTRS is not defined, all attributes will be shown in the ‘Text Type’ part of the concordance form (usually not the desired outcome)


Word classes and lemmas

WPOSLIST

list of pairs providing a mapping between a user-friendly name for a word class, and a regular expression matching the POS-tags which are instances of it. The first character of the string is a separator used to separate values in the rest of the string. If specifed, users can select items like ‘noun’, ‘verb’ from a menu when specifying right or left context for a concordance search. Example for TreeTagger English tagset (modified version of Penn tagset):

WPOSLIST ",adjective,JJ.?,adverb,RB.?.,conjunction,CC,determiner,DT,noun,N.*,noun singular,NN,noun plural,NNS,preposition,IN,pronoun,PP,verb,V..?|MD"

LPOSLIST

list of pairs providing a mapping between a word class suffix, and a user-friendly name for the word class. Only makes sense when there is a mechanism in place for relating lemmas to lemmas-with-a-word-class-suffix, so that, for example, brush (noun) and brush (verb) can get different word sketches. The first character of the string is a separator used to separate values in the rest of the string.

Example from BNC:

LPOSLIST ",adjective,-j,adverb,-a,conjunction,-c,noun,-n,preposition,-p,pronoun,-d,verb,-v"

WSPOSLIST

LPOSLIST of word sketch POSes. Same format as, and defaults to, LPOSLIST, but LPOSLIST if used after Lemma box in the Concordance form whereas WSPOSLIST is used in Word Sketch and Thesaurus forms. This is deprecated since bonito 3.90 — directive *WSPOSLIST in sketch grammar should be used instead.

WSATTR

attribute name for which word sketches are computed, defaults to “lempos” if the corpus has that attribute, or “lemma” if the corpus has that attribute, or DEFAULTATTR otherwise

WSSTRIP

number of characters to strip from the end of a word in a word sketch listings, defaults to 2 if WSATTR is “lempos”, or 0 otherwise

WSDIFFSTEP

minimum difference of WS scores to highlight in different colors

WSMINHITS

mininum frequency for a candidate of word sketches, default value is “0”, it is suitable for filtering nonsignificant relations (e.g. spam, mistakes) and also for faster computing of word sketches in large corpora


Dynamic attributes

A modification your corpus configuration files to include definitions of the “lc” and “lemma_lc” attributes can significantly increase the speed of various query operations. See the definition of these attributes.

DYNAMIC

if this option exists, the attribute is a dynamic one and the value of this option is the name of the C function which defines the attribute. One case is the dynamic attribute ‘lemma’ where the field given in the vertical file is ‘lempos’, built from lemma + ‘-‘ + a letter to indicate word class, so lemma intend maps to lempos intend-v. Then lemma is a dynamic attribute, with the associated function stripping off the last two characters of the lempos. The mechanism is used in the BNC to support querying word sketches, which are word class specific so are defined for a lempos.

Here is the definition of the ‘lemma’ dynamic attribute. The embedded features used are documented below.

ATTRIBUTE   lemma {
     DYNAMIC striplastn
     DYNLIB  internal
     ARG1    "2"
     FUNTYPE i
     FROMATTR lempos
     TYPE   index
}

DYNLIB

dynamic library containing given function

FUNTYPE

type of given function

  • 0 – no extra argument
  • c – one char extra argument
  • s – one (const char*) extra argument
  • i – one int extra argument
  • cc – two char extra arguments
  • ii – two int extra arguments
  • ss – two (const char*) extra arguments
  • ci – two extra arguments, first char, second int
  • cs, sc, si, ic, is – likewise

ARG1

the first optional fixed parameter

ARG2

the second optional fixed parameter

FROMATTR

the name of the attribute from which the dynamic attribute is created

DYNTYPE

type of the dynamic attribute, possible values are plain, lexicon, index (default)

  • plain – only displaying is enabled
  • lexicon – displaying and counting (frequency distribution) are enabled
  • index – all features including querying are enabled
  • freq – like index, but with frequencies for each attribute value being precompiled. This should be used for cases where lots of source attribute values are mapped to a single target (dynamic) attribute value (e.g. URL to top level domain name) where recomputing frequencies from source attribute may take a long time.

TRANSQUERY

use transformation function for queries (multivalues not supported) Example:

ATTRIBUTE   lc {
	DYNAMIC  lowercase
	DYNLIB   internal
	ARG1     "C"
	FUNTYPE  s
	FROMATTR word
	TYPE     index
	TRANSQUERY	yes
}

This means that, for query [lc=”Test”] we apply the function “lowercase” to the argument “Test” to search for “test”; without TRANSQUERY, we would search for “Test” and find nothing

Wordcount

Structure attribute “wordcount” represents a number of words in a structure. The value is calculated during compilation, therefore the attribute should not be present in the vertical file.

STRUCTURE doc {
    ATTRIBUTE wordcount
}

Data representation options

In the following, a list of attribute and structure types is given which can be used to change the way the index is built or accessed at runtime. To place them in the configuration file, modify the STRUCTURE or ATTRIBUTE as follows:

ATTRIBUTE/STRUCTURE <attribute_name> {
      TYPE "<type_name>"
}

e.g.

ATTRIBUTE word {
      TYPE "FD_FGD"
}

Attributes types

The names follow the pattern “<REVFILE>_<TEXTFILE>” where <REVFILE> specifies handling of the .rev index file and <TEXTFILE> specifies handling of the .text index file. The lexicon (.lex) is always memory mapped. If <REVFILE> or <TEXTFILE> starts with “M”, it denotes memory mapping, with “F” it denotes file based access without memory mapping. Generally switching between types that differ only in the M/F letters does not require recompilation, switching between any other types always requires recompilation of the corpus.

MD_MD (default for positional attributes)

Both .rev and .text file use the delta-code index and are memory mapped. Cannot be used for .text files exceeding 500 MB.

FD_MD

As above but the .rev file is not memory mapped while the .text file is.

FD_FD

As above but neither the .rev nor the .text index is memory mapped.

FFD_FD

As above plus the .rev.idx index is not memory mapped too.

FD_FBD

TBA

FD_FGD

Type of an attribute which should be used for any corpora with the main binary file (.text) bigger than 500MB (approx 250M tokens, depending on lexicon size). Neither the .rev nor the .text index is memory mapped, the latter one uses giga delta-codes.

MD_MGD

As above except both .rev and .text are memory mapped.

NoMem

As above but all .rev, .rev.idx, .text and .text.off indices are not memory mapped. In addition, any statistic indices are not memory mapped too.

MD_MI (default for structure attributes)

Both .rev and .text files are memory mapped, but the latter one uses plain integers instead of delta-codes.

FD_MI

As above but the .rev index is not memory mapped.

UNIQUE

Type of an attribute which can be used if the attribute values are unique. The compilation of such an attribute is much faster then and the indices require less space (no reverse index (.rev.*) and .text.* needed)

Structures

Switching among the default (not given) type, file32 and map32 or among file64 and map64 does not require recompilation, any other switching of types requires recompilation of the corpus.

<unspecified> (default)

The .rng file is not memory mapped and not cached. Supports corpora up to 231 positions.

file32

As above but the .rng file is not memory mapped but cached.

map32

As above but the .rng file is memory mapped.

file64

Enables the range file (.rng) to address 263 corpus positions. It is necessary to use when there are more than 231 tokens in the corpus. The .rng file is not memory mapped but cached.

map64

As above but the .rng file is memory mapped. Do not use when the range file is too large so as not to allocate too much system resources (.rng must fit into the system memory).


Navigation