Slovak TenTen corpus.
Apart from standard word, tag, lemma attributes, the corpus also contains an extra attribute called amblevel which is an integer number indicating the level of ambiguity of each word form. It is the number of possible POS-tags for given word form (from which the disambiguator selected one).
Word sketches have been prepared by Vladimír Benko.
v1.0 (13 September 2011)
- initial version – 876 million tokens