German TenTen corpus is a corpus from the TenTen class of corpora usually containing billion and more words.

The corpus is double-tagged with RFTagger (attribute tag, tagset reference) and TreeTagger (attribute tt_tag,  tagset reference).



  • Web texts in German obtained in 2013 – 16.5 billion tokens

v 2.0 (28 April 2011)

  • fixed problems with part-of-speech tagging which caused a major data loss in the previous version
  • 2.8 billion tokens

v 1.0 (30 November 2010)

  • initial version – 1.2 billion tokens