Latvian TenTen corpus was crawled by SpiderLing in April 2014. It was encoded in UTF-8, cleaned and deduplicated. The corpus has been tagged using LVTagger developed by Pēteris Paikens from University of Latvia applying the MULTEXT-East tagset.

August 2016

  • Morphological annotation added.
  • Additional filters applied, including removal of English text. 658 million tokens.

May 2014

  • Initial version. 668 million tokens.

Since the tokenisation was done by an external tool, there is no glue (<g/>) marking non-space token separations.