TenTen Corpus Family

The TenTen Corpus Family (TenTen corpora) is a collection of text corpora created from the Web. TenTen corpora are prepared according to the same criteria that may guarantee quality result corpus texts and also an option to compare them with each other.

The corpus belongs to the TenTen corpus family which is a set of the same processed web corpora with the target size 10+ billion words. Sketch Engine currently provides access to Tenten corpora in more than 30 languages.

Description of preparing TenTen corpora

  1. Corpora are crawled from the Internet with the Spiderling tool, a web spider designed for linguistic purposes.
  2. The web download is followed by text cleaning when texts are processed by jusText, a heuristic based boilerplate removal tool removing irrelevant (non-text or poor text) content such as navigation links, advertisements, headers, footers, etc.
  3. The next step is a tokenization process.
  4. Afterwards, onion performs deduplication on paragraph level.
  5. Finally, corpus texts are lemmatized and part-of-speech tagged for language for which there are tagger and lemmatizer tools are available.

Detailed information about the mentioned tools can be read on the corpus.tools website and the building of TenTen corpora TenTen building is described in the bibliography (below).

Corpus metadata

A list of corpus metadata (structural attributes in corpus linguistics) shared by all TenTen corpora.

Document structures

  • 1st level domain – e.g. “com”
  • 2nd level domain – e.g. “wikipedia.org”
  • Web domain – e.g. “en.wikipedia.org”
  • url – e.g. “https://en.wikipedia.org/wiki/Wikipedia” (URL of the source document)
  • wordcount – e.g. “152” (exact number of words in the document)
  • length – e.g. “0-1k”  (length of the document in thousands of words)

Paragraph structure

  • heading – number “1” means headline texts, “0” other texts

Attributes specific to particular corpora can be found on the corpus information page.

Search the TenTen corpora

Sketch Engine offers a range of tools to work with the TenTen corpora.

or

Tools to work with TenTen Corpora

A complete set of Sketch Engine tools is available to work with TenTen to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

Tools for building new TenTen corpora have constantly developed. More information about these tools is available at http://corpus.tools/

Bibliography

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

TenTen corpora available in Sketch Engine

A list of TenTen corpora currently comprises text corpora of 30+ languages over the last ten years.

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.