Hebrew General corpus

This corpus was crawled from the Internet and includes mostly newspaper materials. It contains more than 150 million words. Development of the corpus was donated by Prof Ari Rappoport and Daphna Shezaf from the Computer Science and Engineering Department at the Hebrew University of Jerusalem.


Web corpus crawled, deduplicated and including multiple domains: blog posts, newspapers, commercial pages, … The size of the corpus is ca 50 million words.

Part-of-speech tagset

See the Hebrew POS tagset summary.