v. 1 untagged (April 2012)
- initial version – 4.8 G words
v. 1 (September 2012)
- tagged by Majka + Desamb
v. 2 (December 2012)
- retagged, corrected ‒ the updated tagset can be found in Miloš Jakubíček, Vojtěch Kovář and Pavel Šmerk. Czech Morphological Tagset Revisited. In 5th Workshop on Recent Advances in Slavonic Natural Language Processing. Brno, 2011, pp. 29–42.
- word sketches
v. 3 “clean” (2013)
- Paragraphs containing more than 20 % of words not recognized by morphological analyser Majka were removed.
v. 4 “clean 2” (March 2014)
- Documents containing a certain wrong character caused by wrong encoding detection were removed.
v. 5 (May 2014)
- Malformed vertical lines corrected (MacLeodovy MacL eodůvk2eAgFnPc1d1 –> MacLeodovy MacLeodův k2eAgFnPc1d1).
v. 6 (June 2014)
- Machine translated documents from domains infostar.cz and navajo.cz removed.
v. 7 (2014-08-04)
- Paragraphs without accents removed.
v. 8 (2014-09-17)
- M ? j removed
Thanks to Marek Grác for spotting much errors and contributing to a cleaner corpus.
Suchomel, Vít (2012). Recent Czech Web Corpora. In 6th Workshop on Recent Advances in Slavonic Natural Language Processing. Brno, pp. 77–83.