Preparing a Text Corpus for Sketch Engine: Overview

This page describes how to prepare a text corpus for indexation by the Manatee corpus management system used as the underlying database backend in Sketch Engine.

Text corpus from a technical point of view

The informal definition of a text corpus usually boils down to something close to “any collection of texts in electronic form”. From a more formal account, a corpus source text consists of:

- positions, i.e. individual occurrences of tokens in the texts, where each position has some associated attributes like word, lemma or tag
- structures, i.e. corpus segments (ranges) spanning a part of a corpus and being defined by their beginning and ending position, usually denoting documents, paragraphs or sentences.
- structure attributes, i.e. attributes of individual structures containing metadata of these structures like date of creation, author etc.

Structures and structure attributes are sometimes referred to as headers or corpus metadata.

The example below illustrates the notions defined above on a sample vertical text:

DESCRIPTION                                      CORPUS VERTICAL TEXT

Begin of structure "doc"
with 2 structure attributes "author" and "year": <doc author="Shakespeare" year="1603">
Begin of sucture "p" for a paragraph:            <p>
Begin of structure "s" for a sentence:           <s>
Position #0 -- all positions have 3 attributes
separated by a tabulator.                        To        to        PREPOSITION
Position #1                                      be        be        VERB
Empty structure "g" denoting a "glue" 
(no space separation) between tokens:            <g/>
Position #2                                      ,         ,         PUNCTUATION
Position #3                                      or        or        CONJUNCTION
Position #4                                      not       not       PARTICLE
Position #5                                      to        to        PREPOSITION
Position #6                                      be        be        VERB
Empty structure "g"                              <g/>
Position #7                                      ,         ,         PUNCTUATION
Position #8                                      that      that      PRONOUN
Position #9                                      is        be        VERB
Position #10                                     the       the       DETERMINER
Position #11                                     question  question  NOUN
Empty structure "g"                              <g/>
Position #12                                     .         .         PUNCTUATION
End of the last structure "s"                    </s>
End of the last structure "p"                    </p>
End of the last structure "doc"                  </doc>

Steps to prepare a text corpus for Sketch Engine

Prepare the source data, including both
- corpus text (positions)
- corpus headers (structures)
Prepare the corpus configuration file
(optionally) Prepare the subcorpus configuration file
This step is needed if you wish to compile subcorpora which can be shared by multiple users
(optionally) Prepare or reuse a word sketch definition file
This step is needed if you require word sketches or thesaurus (the thesaurus takes the word sketch database as input).
Compile (index) the corpus
Verify corpus consistency, integrity and completeness

Text corpus from a technical point of view

Steps to prepare a text corpus for Sketch Engine

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine