Preparing Corpus Text | Sketch Engine

If you want to create a corpus with your own part-of-speech tags and lemmas, you need to upload it to Sketch Engine in a special format called vertical file. In this format (described below), you can also upload corpus data processed by external tools outside Sketch Engine to preserve their PoS tags or lemmatization.

The input format is a “vertical” or “word-per-line (WPL)” text, as defined at the University of Stuttgart in the 1990s. Words are written one word per line, so each line contains one word, number or punctuation mark. It is a plain text file in a selected character encoding, without any formatting.

Suddenly, however, their posture changed.

is in vertical text

Suddenly 
, 
however 
, 
their 
posture 
changed 
.

If the input text is part-of-speech-tagged and lemmatized, then we provide two additional columns, tab-separated, for tag and lemma as here (showing tags from Penn tagset):

Suddenly	RB	suddenly
<g/> 
,	,	, 
however	RR	however 
<g/>
,	,	, 
their	PP$	their 
posture	NN	posture 
changed	VVD	change 
<g/>
.	SENT	.

The “glue” tag is used to specify that there should not be space between two tokens, as between a word and the following punctuation (in Latin and other Western scripts).

Sometimes there might be multiple or disjunctive values for an attribute, for example, if the POS-tagger was undecided between classifying a word as a noun (NN) or a lexical verb (VV), or if a word is associated with two grammatical relations. This can be encoded using a separator character as specified in the Corpus Configuration File: Overview file (attributes MULTIVALUE and MULTISEP), here “;”

brush   NN;VV    brush

XML tags are used for structural annotation including document, sentence or paragraph boundaries, headlines etc. and can have associated attribute-value pairs. For example:

<doc id="G10" n="32"> 
<head type="min"> 
FEDERAL 
CONSTITUTION 
<g/> 
, 
1789 
</head> 
<p n="1"> 
" 
<g/> 
we 
the  
People

There can be any number of attributes associated with words. While the ‘standard’ ones are lemma and POS-tag, the framework can also be used for starting thesaurus category, grammatical function, and a number of other varieties of markup. Sometimes this markup will be most suitably associated with a word, and sometimes with a structural attribute such as a phrase, sentence and paragraph. (There will be different implications on what searches can easily be made, depending on the choice of encoding.) For the special case of text type or ‘header’ information, see Text Types, Headers and Subcorpora.

Annotation tool

The built-in annotation tool allows adding metadata to documents easily.

Annotation tool

Corpus annotation and structures

Read our blog post about corpus annotation and structures in corpora.

Annotation tool

Corpus annotation and structures

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine