Hungarian web corpus was crawled by SpiderLing in June 2012. Corpus was encoded in UTF-8, cleaned and deduplicated. Tagset can be found in websites Hungarian Academy of Sciences.

Currently number of tokens is almost 3.2 billion.

v. 1.0

  • initial version, obtained from the web in 2012