For many kinds of language study, text type is important. If we wish to describe the behaviour of a word, phrase, or grammatical construction, it is always salient to ask whether it occurs across the varieties of the language, or whether it occurs mostly in one dialect, or one domain, or is constrained only to informal language. (We follow Biber (1989) in using ‘text type’ as a coverall terms for the many ways in which we might classify one text, or discourse, as being of a different type to another).

The Sketch Engine supports research into text type distinctions by making it easy for users to constrain searches to particular text types, and by providing analyses of the frequency of a word, phrase or other unit by text type. (See the ‘text type’ options in the main concordance window and, once a concordance is being viewed, ‘text type’ options under the ‘frequency’ function.)

These functions only work well if

  • the documents in the corpus have been classified for text type
  • the corpus has been prepared in a way that makes the text type information accessible to the software.

The basic method is this. The corpus is, we assume, structured as a set of documents. In the vertical file there is a structural unit, let’s call it <doc>, for each document. Text type information is associated with the <doc> element, as a series of XML, atribute-value pairs, so for example if the text type features are ‘region’ and ‘domain’ and a particular document is Australian and about sport we might have

<doc region="Aus" domain="sport">
The document content is here.

The opening <doc> tag is the document’s header. (Attribute names cannot contain characters other than a-z, A-Z, 0-9, underscore. Double-quoting values is mandatory. The markup must be valid XML, all control characters must be replaced by corresponding entities.) Text type information is sometimes also called ‘header information’. Also each feature-value pair specifies a subcorpus of all the documents having that pair: text type information will sometimes also be called subcorpus information.

There may be any number of feature-value pairs. In contrast to approaches to document headers found in, for example, the Text Encoding Initiative, SkE document headers are flat lists of feature-value pairs, not structured objects.

For the Sketch Engine to make the information available for searching, it needs to know about the features: for how they are specified see SkE/CorpusConfig.

Recommendations on text type feature design

Usually, when a corpus is being prepared for the Sketch Engine, it already comes with some header information. The simplest thing to do is to format that information as feature-value pairs without further review. This often does not work well. Information may have been included in headers for a number of reasons and will often include copyright status or a log of who did what and when, and will often not be complete or consistent. While there is no harm in including the copyright or log information in headers, it is not likely to be of use for linguistic research.

The person preparing the corpus needs to ask “what subcorpora would the users like to be able to specify, in order to constrain their searches?”

And then “is that information already in the headers, and if not explicit, is it implicitly?”

The structure and attribute names in the actual data have to correspond to the corpus configuration file. Note the processing is case sensitive.

Not too many subcorpora, and keep them large

Most corpora only support a limited number of linguistically useful subcorpora, and if subcorpora are to be used to constrain searches, a subcorpus must be quite large, or most searches will return no hits. This fits with a user interface consideration: we want to present the user with a limited number of options, all of which he/she understands, in a single screen. For all of these reasons, as a rule of thumb, we suggest that the team preparing the corpus focus on not more than ten features which are likely to be useful for creating subcorpora, with each feature not having more thanten values, and each feature-value pair accounting for atleast 2-3% of the whole corpus.

For example, for the English component of the NCI (New Corpus for Ireland), the features and their possible values are

genre:     imaginative, informative
mode:      spoken, written
region:    Irish, British, Amnerican
ie region: North, South, East, West, u 
     (applies only to Irish English, all else is u(nclassified)
genre2:    arts/culture, business/finance, drama, fiction, govt,
           hard/applied-science, information, leisure, 
           non-fiction, politics, religion/philosophy, 
           social-science, u
medium:    book, conversation, newspaper, official-govt, 
           periodical, unpublished, website, u
           bnc, gigaword, lexmc, limerick-corp, nitcs, web

While the list displays a range of anomalies, it also shows an attempt to take a range of kinds of material from a range of types of sources (as listed as values of the last feature; we used several existing corpora as input) into a coherent and usable whole.

Implicit information

There is much information available that is implicit. Two examples: if there is date information available, then a feature for decade can be built. If the corpus spans several decades, then this will be a useful feature for exploring language change. The date feature by itself will have too many values, each accounting for too little data, to be useful.

Second: many corpora are built up from a number of newspapers as well as other sources, with the name of the newspaper held somewhere in the header or filename. But the newspaper name is not directly useful to users for building subcorpora. A ‘medium’ feature which takes the value ‘newspaper’ for all newspaper material will be useful and can be inferred from information that is available.

In sum: the people preparing the corpus need to consider what subcorpora will be useful to their users, and then, to work out how corresponding features can be built for all or most documents, given the information available in the document headers, filenames, or anywhere else.

Hierarchical headers

Some header fields, especially ones with many possible values can be structured into a hierarchy. Provided these headers are set up properly in the corpus, the system can handle a hierarchy and present this within Sketch Engine in a user friendly way. See below:


How to set up a hierarchical header

The display in the interface is driven by the data, i.e. by the vertical file. In each header field, all categories must be present that the document belongs to, in form of complete “paths” in the hierarchy. E.g. for a document that belongs to classes “sport”, “sport->football”, “business”, “business->football” and “business->football->manager” the corresponding record will look as follows:

<doc domain="sport,sport::football,business,business::football,business::football::manager">

where ‘,’ is the multivalue separator (can be any character) and ‘::’ is the hierarchy labels separator (can be any string). These two values are specified in the corpus config file.

Also, the hierarchical headers need to be specified as MULTIVALUE in the corpus config file. Example of a hierarchical attribute definition in the corpus config file:

   ATTRIBUTE domain {
      MULTIVALUE  "1"
      MULTISEP ","
      HIERARCHICAL   "::"

Please note that such settings cannot be created by automatic means in the Corpus Architect, you have to edit the corpus config file directly, e.g. by switching the corpus to the expert mode.