Hebrew General corpus
This corpus was crawled from the Internet and includes mostly newspaper materials. It contains more than 150 million words. Development of the corpus was donated by Prof Ari Rappoport and Daphna Shezaf from the Computer Science and Engineering Department at the Hebrew University of Jerusalem.
Web corpus crawled, deduplicated and including multiple domains: blog posts, newspapers, commercial pages, … The size of the corpus is ca 50 million words.
Tagger and tagset
The tagger output in Hebrew has 28 different attributes, some of which relevant only to certain parts-of-speech.
For more information about these corpora, see the full description.