Estonian web corpus was crawled by SpiderLing in 2013. It was encoded in UTF-8, cleaned and deduplicated. You learn more about corpus and its tagset in documentation (available here ).
The current version of the corpus has 330 million tokens.
- the initial version obtained from the web in 2013