csSkELL: Czech Corpus for SkELL
The Czech Corpus for SkELL is a text database is used by the Czech SkELL interface (csSkELL) available at http://cskell.sketchengine.co.uk/run.cgi/skell. The corpus does not contain whole documents but only sentences sorted according to their text quality.
In terms of corpus search, this approach means:
- the previous sentence does not relate to the following sentence
- sentences of first concordances should be better than following ones in the point of less non-alphabet characters and interpunctions, more frequent words, etc.
The score of text quality was computed by the GDEX system.
The corpus is made up from websites classified by Czech Webarchiv in terms of selective harvests. The second source is 1800 crawled websites provided by Webarchiv. The next source are articles and discussions from Czech Wikipedia (downloaded in April 2017) and texts from the domain .cz of Czech Timestamped web corpus.
The domain variety text collection within the corpus enables users to explore the Czech language in its everyday usage over 1.4 billion words in more than 90 million sentences.