koTenTen: Corpus of the Korean Web

The Korean Web Corpus (koTenTen) is a language corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Data was crawled by the SpiderLing web spider in August & September 2012.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

The koTenTen corpus was annotated by the POS tagger HanNanum using following simplified tagset.

Tools to work with the Korean Web corpus

A complete set of Sketch Engine tools is available to work with this Korean Web corpus to generate:

  • word sketch – Korean collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Korean nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

v1.0 (10 September 2012)

  • initial version – 461 million words

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Korean koTenTen corpus

Sketch Engine offers a range of tools to work with this Korean corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.