Czech TenTen family web corpus crawled by SpiderLing in 2011 and Heritrix in 2010. Encoded in UTF-8, cleaned, deduplicated. Tagged by Majka + Desamb in 2012.

Changelog

v. 1 untagged (April 2012)

  • initial version – 4.8 G words

v. 1 (September 2012)

  • tagged by Majka + Desamb

v. 2 (December 2012)

v. 3 “clean” (2013)

  • Paragraphs containing more than 20 % of words not recognized by morphological analyser Majka were removed.

v. 4 “clean 2” (March 2014)

  • Documents containing a certain wrong character caused by wrong encoding detection were removed.

v. 5 (May 2014)

  • Malformed vertical lines corrected (MacLeodovy MacL eodůvk2eAgFnPc1d1 –> MacLeodovy MacLeodův k2eAgFnPc1d1).

v. 6 (June 2014)

  • Machine translated documents from domains infostar.cz and navajo.cz removed.

v. 7 (2014-08-04)

  • Paragraphs without accents removed.

v. 8 (2014-09-17)

  • M ? j removed

Thanks to Marek Grác for spotting much errors and contributing to a cleaner corpus.


Bibliography

Suchomel, Vít (2012). Recent Czech Web Corpora. In 6th Workshop on Recent Advances in Slavonic Natural Language Processing. Brno, pp. 77–83.