Polish TenTen web corpus was crawled by a web spider SpiderLing in June 2012. It contains more than 22 million documents and almost 7.8 billion words in total. The corpus is encoded in UTF-8, cleaned and deduplicated by a deduplication tool Onion, lemmatised and tagged by WCRFT (Wrocław CRF Tagger) with the NKJP tagset (used for Narodowy Korpus Języka Polskiego).

Tags legend
adjective adj.*
adverb adv.*
conjunction conj.*|comp.*
noun subst.*
preposition prep.*
pronoun ppron.*|siebie.*
verb fin.*|bedzie.*|aglt.*|praet.*|impt.*|imps.*|inf.*|pcon.*|pant.*|ger.*|pact.*|ppas.*

Tagset

Following tagset is published in the article:

PRZEPIÓRKOWSKI, Adam. A comparison of two morphosyntactic tagsets of Polish. In: Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop. Warsaw, 2009. pp. 138–144.

Attributes

number = sg pl
case = nom gen dat acc inst loc voc
gender = m1 m2 m3 f n
person = pri sec ter
degree = pos com sup
aspect = imperf perf
negation = aff neg
accommodability = congr rec
A comparison of two morphosyntactic tagsets of Polish [1ex] July 15, 2009 5
accentability = akc nakc
post-prepositionality = npraep praep
agglutination = agl nagl
vocalicity = nwok wok
fullstoppedness = pun npun

Part of speech

adja =
adjp =
adjc =
conj =
comp =
interp =
pred =
xxx =
adv = [degree]
imps = aspect
inf = aspect
pant = aspect
pcon = aspect
qub = [vocalicity]
prep = case [vocalicity]
siebie = case
subst = number case gender
depr = number case gender
ger = number case gender aspect negation
ppron12 = number case gender person [accentability]
ppron3 = number case gender person [accentability] [post-prepositionality]
num = number case gender accommodability
numcol = number case gender accommodability
adj = number case gender degree
pact = number case gender aspect negation
ppas = number case gender aspect negation
winien = number gender aspect
praet = number gender aspect [agglutination]
bedzie = number person aspect
fin = number person aspect
impt = number person aspect
aglt = number person aspect vocalicity
brev = fullstoppedness
burk =
interj =

## This class should not appear in the results of manual annotation:

ign =
## Non-defeasible constraints:
##
## siebie –> base = siebie
## siebie –> case IN gen dat acc inst loc
## pant –> aspect = perf
## pcon –> aspect = imperf
## pact –> aspect = imperf
## ger –> gender = n
## depr –> number = pl
## depr –> gender = m2
## depr –> case IN nom voc acc
## numcol –> gender IN n m1
## aglt –> aspect = imperf
## bedzie –> aspect = imperf
## impt –> number:person IN sg:sec pl:pri pl:sec
## prep –> case IN nom gen dat acc inst loc
## Defeasible constraints:
##
## ger –> number = sg
## num –> number = pl

Changelog

v1.0 (23 July 2012)

  • initial version – 7.7 billion words, untagged

a sample for Cesar (25 October 2012)

v2 1 July 2013

  • the whole tagged by WCRFT