Corpus structures – parts of a corpus

A corpus is a collection of a very large amount of text that is used, together with a suitable corpus management software such as Sketch Engine, to learn about how language is used. It has become an indispensable tool for all modern linguists and lexicographers.

A text corpus can consist of only one very long line of text. This is, however, impractical for many reasons. Dividing a corpus into smaller parts makes it possible to include/exclude certain parts when searching.

Similarly, it allows the user to count how many files, texts or documents contain a particular word or phrase in order to check how evenly it is distributed across the corpus and judge whether the word is in general use or limited to specific topics only. It might also be useful to know whether a word tends to appear in long sentences (suggesting it might be a formal word) or in very short sentences (suggesting the word tends to be used in informal spoken language). To do this, a corpus has to be equipped with marks or labels indicating the beginnings and ends of such parts. These marks or labels are called structure tags and the parts of a corpus they mark are called structures. The most typical parts are files, paragraphs and sentences.

Annotation tool

The built-in annotation tool allows adding metadata to documents easily.

metadata annotation

While a corpus without structures remains usable in many respects, it is treated as one long continuous line of text. Searching for the word look followed by the word up at the distance of 1 to 3 words from each other will also find instances where one word is a part of one sentence and the other word in the following sentence. This might be unwanted behaviour. It is the structures that will make it possible to take sentence boundaries into account.

Corpus management software generally does not prescribe (and neither does Sketch Engine) what structures should be included in the corpus and what they should look like. It is, however, advisable to include at least the basic set marking the beginning and end of a document, paragraph and sentence. By default, Sketch Engine will try to identify these three structures when uploading content and will supply the corresponding structure tags automatically.

The basic structure of a corpus with one document, two paragraphs and two sentences in each of the paragraphs might look like this:

<doc>
     <p>
          <s>My Bonnie lies over the ocean</s>
          <s>My Bonnie lies over the sea</s>
     </p>
     <p>
          <s>My Bonnie lies over the ocean</s>
          <s>Oh, bring back my Bonnie to me</s>
     </p>
</doc>

The indentation above is used purely for the reader’s convenience. The data can also be uploaded as one line of text and Sketch Engine will still process the structures correctly:

<doc><p><s>My Bonnie lies over the ocean</s><s>My Bonnie lies over the sea</s></p><p><s>My Bonnie lies over the ocean</s><s>Oh, bring back my Bonnie to me</s></p></doc>

The corpus author might decide to use different tags, e.g. document, para or sent or use any other sequence of characters provided they observe the required conventions. In Sketch Engine, the format requires angle brackets and the closing tag must have a slash. This format is generally accepted by other corpus software too.

The author may also decide to include other marks. For a corpus that contains lots of direct speech, it might be useful to mark the beginning and end of the direct speech so that searches and queries can be restricted only to direct speech. Marking the beginning and end of proverbs is definitely going to be useful for a research into their use. It is totally up to the corpus author to decide which structures will be included.

Corpora in Sketch Engine preserve the structures created by corpus authors. For corpora created in Sketch Engine by crawling the web, at least the minimum set (file, paragraph and sentence) is inserted automatically so that users can look up sentences containing or not containing certain phrases for example.

Corpus annotation and metadata

Each structure can, but does not have to, have additional labels giving more specific information about the structure. These are called meta data or structure attributes. For example, the structure might carry information about the year of publication, the genre, the dialect, the style, author, source, simply anything that the author wants to include. 

If the corpus author is unsure as to what to include, it is best to include all available metadata. A corpus without metadata can still be used for many tasks but metadata (information about the text) cannot be used in searches or analysis.

 Meta data (or structure values) are automatically processed by Sketch Engine into text types, making it easy to set search criteria or build subcorpora from texts belonging to the same category.

 An example of a corpus consisting of two documents, each document has two paragraphs and paragraph has two sentences with information about the year of publication of the whole document, the style of each paragraph.

<doc pub="1970" lang="en">
     <p style="informal">
          <s><pers gender="female">Rebecca</pers> has worked with a full range of clients including <brand sect="automotive">BMW</brand> and <brand sect="air">Airbus</brand>.</s>
          <s> some text </s>
     </p>
     <p style="formal">
          <s>some text </s>
          <s>some text </s>
     </p>
</doc>
<doc pub="1977">
     <p style="informal">
          <s>some text </s>
          <s> some text </s>
     </p>
     <p style="informal">
          <s>some text </s>
          <s>some text </s>
     </p>
</doc>

In the example above, the author also enclosed names of people (Rebecca) and brands (BMW and Airbus) in structure tags and enriched the latter with information about the industry sector.

Automatic structures and metadata

Corpora in Sketch Engine preserve the metadata as provided by corpus authors. In addition, these structures will be inserted automatically:

Paragraph and sentence structures will be inserted in both cases.

Metadata have to be added externally using a text editor or a suitable tool, however, corpora created by crawling the web are automatically enriched with the following metadata added to the file structure:

  • url of the downloaded page (e.g. www.bignews.info/international/flight-to-moon.html)
  • website (e.g. cnn.com, va.gov …)
  • top-level domain (e.g. com, org, eu, co.uk, cz…)

User corpora created from the web keep the url of each downloaded page.

When building a corpus from your own files, it is recommended to think about structures and metadata and add them before uploading the data to Sketch Engine. In case of any doubts, the user is invited to request assistance and advice from support@sketchengine.eu

Topic classification
corpus from the web
blog: pos tags

POS tags

OneClick Terms - multi-word term extraction
Screenshot of thesaurus from esTenTen Spanish corpus

Automatic thesaurus