Slovak TenTen corpus.

The corpus has been tagged by the Ľ. Štúr Institute of Linguistics of Slovak Academy of Sciences. Information about tagging including the tagset reference can be found here (in Slovak).

Apart from standard word, tag, lemma attributes, the corpus also contains an extra attribute called amblevel which is an integer number indicating the level of ambiguity of each word form. It is the number of possible POS-tags for given word form (from which the disambiguator selected one).

Word sketches have been prepared by Vladimír Benko.


v1.0 (13 September 2011)

  • initial version – 876 million tokens