csSkELL: Czech Corpus for SkELL

The Czech Corpus for SkELL is a text database used by the Czech SkELL interface (csSkELL) available at http://cskell.sketchengine.co.uk/run.cgi/skell. The corpus does not contain whole documents but only sentences sorted according to their text quality.

In terms of corpus search, this approach means:

  • the previous sentence does not relate to the following sentence
  • sentences of first concordances should be better than following ones in the point of less non-alphabet characters and interpunctions, more frequent words, etc.

The score of text quality was computed by the GDEX system.

The corpus is made up of websites classified by Czech Webarchiv in terms of selective harvests. The second source is 1800 crawled websites provided by Webarchiv. The next source is articles and talk pages from Czech Wikipedia (downloaded in April 2017) and texts from the domain .cz of Czech Timestamped web corpus.

The domain variety text collection within the corpus enables users to explore the Czech language in its everyday usage over 1.4 billion words in more than 90 million sentences.

What is SkELL?

SkELL (Sketch Engine for Language Learning) is a simple tool for students and teachers of language to easily check whether or how a particular phrase or a word is used by real speakers of a language.

No registration or payment required. Just type a word and click a button.

All examples, collocations and synonyms were identified automatically by ingenious algorithms and state-of-the-art software analysing large multi-billion samples of text. No manual work was involved.

csSkELL is a Czech version of the SkELL tool based on Sketch Engine.

Statistics

Source no. of words percentage
Webarchiv: selective harvests ~ 987,299,101 68.40 %
Webarchiv: other sources ~ 232,047,827 16.07%
Timestamped web corpus ~ 133,488,941 9.24 %
Wikipedia including talk pages ~ 90,575,062 6.27 %
Total 1,443,410,941 100,00%

Availability

The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).

Tools to work with Czech SkELL corpus

A complete set of Sketch Engine tools is available to work with this Czech SkELL corpus to generate:

  • word sketch – Czech collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Czech nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

VERSION DESCRIPTION Corpus size (words)
1.0 initial version without any cleaning 1,717,516,129
2.0 first published version with simply cleaning Slovak and texts without diacritics from corpus and removed headlines 1,608,867,697
2.1 further cleaning of Slovak texts and texts without diacritics;
removed sentences containing:
– automatically created texts by Wikipedia
– non-ASCII characters
– only nonalphabetical characters
– HTML tags
– URL and email addresses
1,552,052,945
2.2 (current version)
further cleaning of texts without diacritics, removed most sentences with GDEX value “0”,removed sentences starting with n-rams (Václav MORAVEC , | Moderátor – Václav Moravec), removed sentences not starting/ending with tag 1,443,410,941
2.3 (is being prepared) further cleaning of texts without diacritics, removed sentences containing hapax legomenon (word with only 1 occurrence in the whole corpus)

Bibliography

Kilgarriff, A. (2012, September). Getting to know your corpus. In International Conference on Text, Speech and Dialogue (pp. 3-15). Springer Berlin Heidelberg.

Czech SkELL Corpus

distribution of text sources

Webarchiv: selective harvests (68.40 %)

Webarchiv: other sources (16.07%)

Timestamped web corpus (9.24 %)

Wikipedia including talk pages (6.27 %)

Other text corpora in Sketch Engine

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.