The Czech Web Corpus (czTenTen) is a language corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Data was crawled by the SpiderLing web spider in spring 2012 and comprised of almost 5 billion words.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

The czTenTen corpus was POS annotated by the Majka tool using the following POS tagset.

Tools to work with the Czech Web corpus

A complete set of Sketch Engine tools is available to work with this Czech corpus to generate:

  • word sketch – Czech collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Czech nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context


version 9 (2017-05-20)

  • added lempos (a combination of lemma and one-letter abbreviation of the part of speech, e.g. dům-n)

version 8 (2014-09-17)

  • M ? j removed

Thanks to Marek Grác for spotting much errors and contributing to a cleaner corpus.

version 7 (2014-08-04)

  • Paragraphs without accents removed.

version 6 (June 2014)

  • Machine translated documents from domains and removed.

version 5 (May 2014)

  • Malformed vertical lines corrected (MacLeodovy MacL eodůvk2eAgFnPc1d1 –> MacLeodovy MacLeodův k2eAgFnPc1d1).

version 4 “clean 2” (March 2014)

  • Documents containing a certain wrong character caused by wrong encoding detection were removed.

version 3 “clean” (2013)

  • Paragraphs containing more than 20 % of words not recognized by morphological analyzer Majka were removed.

version 2 (December 2012)

version 1 (September 2012)

  • tagged by Majka + Desamb

version 1 untagged (April 2012)

  • initial version – 4.8 G words


