Danish TenTen web corpus.

The corpus is cleaned by jusText, deduplicated by onion, tokenised and tagged with CST’s TaggerXML (tagset documentation) and lemmatised using CST’s lemmatiser.

Structural attributes

Common TenTen corpora attributes

Changelog

v. 4 (April 2015)

  • Tagging of word “Big” corrected to tag “EGEN”

v. 3 (March 2015)

  • POS tags with slashes corrected, unwanted HTML tags filtered out

v. 2 (4 September 2014)

  • Norwegian filtered out

v. 1 (1 July 2014)

  • lempos added
  • better cleaning of messy data and tagger mistakes

v. 0 (12 March 2014)

  • daTenTen14 – 2.4 billion tokens
  • crawled by SpiderLing in Jan 2014