deTenTen: Corpus of the German Web

The German Web Corpus (enTenTen) is a corpus made up of texts collected from the internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

The corpus was tagged by RFTagger using this POS tagset.

Overview of German TenTen corpora

These web corpora were crawled and processed repeatedly over the years:

  • German Web corpus 2013 (enTenTen13) – 16.5 billion words
  • German Web corpus 2010 (enTenTen10) – 2.3 billion words

Tool work with German Web corpus

A complete set of Sketch Engine tools is available to work with this German Web corpus to generate:

  • word sketch – German collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of German nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

German Web corpus (deTenTen13)

  • obtained in 2013 – 16.5 billion tokens

German Web corpus (deTenTen10)

  • version 2.0 (28 April 2011)
    • fixed problems with part-of-speech tagging which caused a major data loss in the previous version
    • 2.8 billion tokens
  • version 1.0 (30 November 2010)
    • initial version – 1.2 billion tokens

Bibliography

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen Corpus Family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Search the German Web corpus

Sketch Engine offers a range of tools to work with the German Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.