csSkELL: Czech Corpus for SkELL

The Czech Corpus for SkELL is a text database used by the Czech SkELL interface (csSkELL) available at http://cskell.sketchengine.co.uk/run.cgi/skell. The corpus does not contain whole documents but only sentences sorted according to their text quality.

In terms of corpus search, this approach means:

  • the previous sentence does not relate to the following sentence
  • sentences of first concordances should be better than following ones in the point of less non-alphabet characters and interpunctions, more frequent words, etc.

The score of text quality was computed by the GDEX system.

The corpus is made up of websites classified by Czech Webarchiv in terms of selective harvests. The second source is 1800 crawled websites provided by Webarchiv. The next source is articles and talk pages from Czech Wikipedia (downloaded in April 2017) and texts from the domain .cz of Czech Timestamped web corpus.

The domain variety text collection within the corpus enables users to explore the Czech language in its everyday usage over 1.4 billion words in more than 90 million sentences.

What is SkELL?

SkELL (Sketch Engine for Language Learning) is a simple tool for students and teachers of language to easily check whether or how a particular phrase or a word is used by real speakers of a language.

No registration or payment required. Just type a word and click a button.

All examples, collocations and synonyms were identified automatically by ingenious algorithms and state-of-the-art software analysing large multi-billion samples of text. No manual work was involved.

csSkELL is a Czech version of the SkELL tool based on Sketch Engine.

Statistics

Source no. of words percentage
Webarchiv: selective harvests ~ 987,299,101 68.40 %
Webarchiv: other sources ~ 232,047,827 16.07%
Timestamped web corpus ~ 133,488,941 9.24 %
Wikipedia including talk pages ~ 90,575,062 6.27 %
Total 1,443,410,941 100,00%

Availability

The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).

Tools to work with Czech SkELL corpus

A complete set of Sketch Engine tools is available to work with this Czech SkELL corpus to generate:

  • word sketch – Czech collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Czech nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

English Web 2015 (enTenTen15)

  • initial size 28 billion words

v2 (spring 2017)

  • 15 billion words
  • genre classification
  • depth analysis of spam and its removal including too short documents

English Web 2013 (enTenTen13)

  • 19 billion words

English Web 2012 (enTenTen12)

version 1 (14 June 2012)

  • sample of corpus – 3.7 billion words
  • crawled by SpiderLing in May 2012
  • encoded in UTF-8

version 2 (2012)

  • full corpus – 11 billion words

English Web 2008 (enTenTen08)

version 1 (15 November 2010)

  • initial version – 3.3 billion tokens
  • crawled by Heritrix in 2008
  • encoded in Latin1

Bibliography

Kilgarriff, A. (2012, September). Getting to know your corpus. In International Conference on Text, Speech and Dialogue (pp. 3-15). Springer Berlin Heidelberg.

Czech SkELL Corpus

distribution of text sources

Webarchiv: selective harvests (68.40 %)

Webarchiv: other sources (16.07%)

Timestamped web corpus (9.24 %)

Wikipedia including talk pages (6.27 %)

Other text corpora in Sketch Engine

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.