Annotating your corpus

To annotate a corpus means to add information (metadata) about the text. This information can relate to structures (documents, paragraphs, sentences etc.) or individual tokens.


structurestokens
anotated segmenttext of any length between one token and the whole corpusexactly one token
used foryear of publication
source (website,book, newspaper)
author name
register (formal,informal)
type of named entity (polititian,actor…)
and an endless list of other options
part of speech tags
lemmas
(or some information that always relates to a token and not several tokens)
automatic vs. manualmanual, possibly aided by an external annotation editor/software, e.g. Bratautomatic using taggers and lemmatizers that are part of Sketch Engine

manual only necessary if automatic tools for the language are not part of Sketch Engine
OR
(speaking highly hypothetically) if the tags and lemmas produced automatically need to be corrected manually

proceduremetadata are added prior to uploading corpus to Sketch Engine
OR
a user corpus in Sketch Engine can be downloaded, annotated externally and uploaded back to Sketch Engine
tagging and lemmatizing happens upon uploading content to Sketch Engine or in the case of corpora from the web, during the processing