Adding structures and metadata

Adding structures, structural attributes and values makes it possible to annotate (add metadata) to a corpus. Document, paragraph and sentence structures are normally added automatically when building a corpus in Sketch Engine but other structures must be added manually if required. If a corpus is annotated with metadata (text types), any search in Sketch Engine can be limited to specific text types only using the text type selector.

If you are new to corpus annotation, you might like to read this blog post first.

Procedure in a nutshell

(If your corpus is in Sketch Engine, first download it.)

  • Open the corpus in a plain text editor or annotation software.
  • Add structures, attributes and values.
  • Upload it to Sketch Engine. Attributes and values will be processed into text types automatically.

Terminology and format

Metadata can only be added to structures (document, sentence, paragraph, noun phrases and others that exist in the corpus or that the user introduces into the data). The structure must surround the text to be annotated.

To annotate a sentence, a sentence structure must mark the beginning and end of the sentence. The annotation is then added to the beginning of the structure.

An example of a sentence annotation:

<s direct_speech="yes" type="question">Have you had time to think it over?</s>

An example of a noun phrase annotation:

<s direct_speech="yes" type="affirm">I like <n_phrase type="noun-of-noun" words="5">the colour of your boots</n_phrase>.</s>

s and n_phrase are structure names
Structure names must be enclosed in angle brackets <> and can only use letters a-z, A-Z, numbers 0-9 and underscore (_). Structure name cannot start with a number.

direct_speech, type and words are attribute names
Attribute names can only use letters a-z, A-Z, numbers 0-9 and underscore (_). Attributes can be multivalue or even hierarchical.

yes, question, noun-of-noun and 5 are values
Values must be enclosed in plaintext double quotes, rounded typographic quotes are not allowed. Values can contain any characters including accented characters. If a double quote is part of a value, it must be escaped with a backslash \"

No spaces are allowed around the equal sign between attribute and it’s value.

wrong:

<doc type = "spoken">

correct:

<doc type="spoken">

Automatic structures

Documents uploaded to Sketch Engine are automatically surrounded by the document structure. Sentences are automatically recognized and surrounded by the sentence structure. Paragraph structures are only inserted automatically into web pages downloaded by Sketch Engine.

Examples

Annotation with structures but without attributes and values.

<doc>
     <p>
          <s>My Bonnie lies over the ocean</s>
          <s>My Bonnie lies over the sea</s>
     </p>
     <p>
          <s>My Bonnie lies over the ocean</s>
          <s>Oh, bring back my Bonnie to me</s>
     </p>
</doc>

The indentation can be used for the user’s convenience. White space between structures is ignored. The same data in one line will still be processed correctly:

<doc><p><s>My Bonnie lies over the ocean</s><s>My Bonnie lies over the sea</s></p><p><s>My Bonnie lies over the ocean</s><s>Oh, bring back my Bonnie to me</s></p></doc>

Metadata

Metadata consist of the attribute (the type of metadata, e.g. publication year) and the value (the actual metadatum, e.g. 1968). The attribute can be anything written in letters of the English alphabet or underscore _). The attribute can be abbreviated and the corpus can be configured to present the user with a human-friendly name. E.g. the corpus can contain but the configuration file can be edited to show this in the interface as Year of publication.

An example of a corpus consisting of 2 files, with structures and structure attributes (metadata).

<doc pub="1970" lang="en">
     <p style="informal">
          <s><pers gender="female">Rebecca</pers> has worked with a full range of clients including <brand sect="automotive">BMW</brand> and <brand sect="air">Airbus</brand>.</s>
          <s> some text </s>
     </p>
     <p style="formal">
          <s>some text </s>
          <s>some text </s>
     </p>
</doc>
<doc pub="1977">
     <p style="informal">
          <s>some text </s>
          <s> some text </s>
     </p>
     <p style="informal">
          <s>some text </s>
          <s>some text </s>
     </p>
</doc>

Document annotation tool

The built-in annotation tool allows adding metadata to documents easily.

metadata annotation

Other annotation tools

The Sketch Engine interface only allows assigning metadata to documents. To insert or annotate other structures, use a plain text editor or an external annotation tools.

Annotation tools are usually designed for a specific annotation taks. General-purpose annotation tools are not easy to find. The UAM Corpus Tool is worth trying.