The roTenTen corpus is a Romanian web corpus from the TenTen Corpus Family crawled from the Internet by the SpiderLing tool during June 2016.

The corpus is morphologically tagged with TreeTagger using UTF-8 Romanian parameter file and there was applied the Romanian tagset.

Structural attributes and preparation of the corpus

More information about the preparing TenTen can be found on the Common TenTen corpora attributes documentation page.

Changelog

v1.0 (August 2016)

  • initial version – 3.14 billion tokens