Polish web corpus is a web corpus from TenTen corpora family. It was crawled by the SpiderLing web spider in June 2012 and contains more than 22 million documents and almost 7.8 billion words in total.

The corpus is encoded in UTF-8, cleaned and deduplicated by the Onion deduplication tool, lemmatised and tagged by WCRFT (Wrocław CRF Tagger) with the NKJP tagset (used for Narodowy Korpus Języka Polskiego).

Part-of-speech tagset

A list of used part-of-speech tags find on the Polish NKJP part-of-speech tagset page.

Changelog

v1.0 (23 July 2012)

  • initial version – 7.7 billion words, untagged

a sample for Cesar (25 October 2012)

v2 1 July 2013

  • the whole tagged by the WCRFT tagger