(version 2)

This corpus was created Corpus Factory method in 2014 and is enconded in UTF-8. It has 48.6 million words and contains lemmatisation and tags.

Vertical provided by Andrius Utka. Tagset documentation follows:

No.	Feature group	Category	Tag codes
1	Part of Speech
	Noun	N
		Adjective	A
		Numeral	M
		Pronoun	P
		Verb	V
		Adverb	R
		Interjection	I
		Onomatopoeia	O
		Particle	Q
		Preposition	S
		Conjunction	C
		Acronym	Z
		Abbreviation	Y
		Roman numbers	U
		Residual	X
		Stable phrases	H
		Punctuation mark, symbols	T
		HTML tag	t
2	Noun types	proper	p
		common	c
3
	Verb
	main	m
		infinitive	n
		participle	p
		adverbial participle	a
		half participle	h
		adverbial participle2	b
		indicative mood	i
		imperative mood	m
		subjective mood	s
4	Numerals	cardinal	c
		ordinal	o
		multiple	m
		collective	l
5	Definiteness	pronominal	p
		non-pronominal	n
6	Reflexiveness	reflexive	r
		non-reflexive	n
7
	Type	active	a
		passive	p
		necessity	n
8	Tense 	present tense	p
		past tense	a
		past frequentative case	q
		future tense	f
		simple past	s
9	Degree	positive	p
		comparative	c
		superlative	s
10	Gender	feminine	f
		masculine	m
		neuter	n
		common	c
11	Number	singular	s
		plural	p
		dual	d
12	Case	nominative	n
		genitive	g
		dative	d
		accusative	a
		instrumental	i
		locative	l
		vocative	v
		illiative	x
13	Person	1st	1
		2nd	2
		3rd	3
14.	Positiveness	positive	p
		negative	n
15.	Phrases	stable phrases with undefined POS	H
16.	Unknown	foreign	f
		typos	t
		segmentation error	p