Finnish TenTen web corpus crawled by SpiderLing in February 2014.
The corpus is cleaned by jusText, tokenised by unitok, deduplicated by onion and tagged with TreeTagger (list of tags, annotation manual).
Common TenTen corpora attributes
1.0 (21 May 2014)
- created – 1.7 billion tokens
2.0 (8 August 2014)
- Lemma, lempos added. Thanks to Josh Waxman for the Finnish TreeTagger model.