Latvian TenTen corpus was crawled by SpiderLing in April 2014. It was encoded in UTF-8, cleaned and deduplicated. The corpus has been tagged using LVTagger developed by Pēteris Paikens from University of Latvia applying the MULTEXT-East tagset.
- Morphological annotation added.
- Additional filters applied, including removal of English text. 658 million tokens.
- Initial version. 668 million tokens.
Since the tokenisation was done by an external tool, there is no glue (<g/>) marking non-space token separations.