czTenTen: Corpus of the Czech Web

The Czech Web Corpus (czTenTen) is a language corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Data was crawled by the SpiderLing web spider in spring 2012 and comprised of almost 5 billion words.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

The czTenTen corpus was POS annotated by the Majka tool using the following POS tagset.

Tools to work with the Czech Web corpus

A complete set of Sketch Engine tools is available to work with this Czech corpus to generate:

  • word sketch – Czech collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Czech nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 9 (2017-05-20)

  • added lempos (a combination of lemma and one-letter abbreviation of the part of speech, e.g. dům-n)

version 8 (2014-09-17)

  • M ? j removed

Thanks to Marek Grác for spotting much errors and contributing to a cleaner corpus.

version 7 (2014-08-04)

  • Paragraphs without accents removed.

version 6 (June 2014)

  • Machine translated documents from domains infostar.cz and navajo.cz removed.

version 5 (May 2014)

  • Malformed vertical lines corrected (MacLeodovy MacL eodůvk2eAgFnPc1d1 –> MacLeodovy MacLeodův k2eAgFnPc1d1).

version 4 “clean 2” (March 2014)

  • Documents containing a certain wrong character caused by wrong encoding detection were removed.

version 3 “clean” (2013)

  • Paragraphs containing more than 20 % of words not recognized by morphological analyzer Majka were removed.

version 2 (December 2012)

version 1 (September 2012)

  • tagged by Majka + Desamb

version 1 untagged (April 2012)

  • initial version – 4.8 G words

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

czTenTen corpus

Suchomel, Vít (2012). Recent Czech Web Corpora. In 6th Workshop on Recent Advances in Slavonic Natural Language Processing. Brno, pp. 77–83.

Search the Czech corpus

Sketch Engine offers a range of tools to work with the czTenTen corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.