(version 1.1)

Tagset

MULTEXT-East Morphosyntactic Specifications, Version 4

Structures

doc – document
p – paragraph
s – sentence

The attributes of the “doc” structure are the following:

url – URL of the document
title – title of the document
domain – the domain derived form the url
lexicon_coverage – the percentage of tokens known by the Croatian Morphological Lexicon (a very rough estimate of the "standardness" of the document content).

Stats

  • vertical 25G (compressed 4.7G)