This corpus was created Corpus Factory method in 2014 and is enconded in UTF-8. It has 48.6 million words and contains lemmatisation and tags.
Vertical provided by Andrius Utka. Tagset documentation follows:
No. Feature group Category Tag codes 1 Part of Speech Noun N Adjective A Numeral M Pronoun P Verb V Adverb R Interjection I Onomatopoeia O Particle Q Preposition S Conjunction C Acronym Z Abbreviation Y Roman numbers U Residual X Stable phrases H Punctuation mark, symbols T HTML tag t 2 Noun types proper p common c 3 Verb main m infinitive n participle p adverbial participle a half participle h adverbial participle2 b indicative mood i imperative mood m subjective mood s 4 Numerals cardinal c ordinal o multiple m collective l 5 Definiteness pronominal p non-pronominal n 6 Reflexiveness reflexive r non-reflexive n 7 Type active a passive p necessity n 8 Tense present tense p past tense a past frequentative case q future tense f simple past s 9 Degree positive p comparative c superlative s 10 Gender feminine f masculine m neuter n common c 11 Number singular s plural p dual d 12 Case nominative n genitive g dative d accusative a instrumental i locative l vocative v illiative x 13 Person 1st 1 2nd 2 3rd 3 14. Positiveness positive p negative n 15. Phrases stable phrases with undefined POS H 16. Unknown foreign f typos t segmentation error p