Hebrew General corpus

This corpus was crawled from the Internet and includes mostly newspaper materials. It contains more than 150 million words. Development of the corpus was donated by Prof Ari Rappoport and Daphna Shezaf from the Computer Science and Engineering Department at the Hebrew University of Jerusalem.

HebWaC

Web corpus crawled, deduplicated and including multiple domains: blog posts, newspapers, commercial pages, … The size of the corpus is ca 50 million words.

Tagger and tagset

The tagger output in Hebrew has 28 different attributes, some of which relevant only to certain parts-of-speech.

Tagset overview

foreign
noun
punctuation
verb
preposition
properName
adjective
numeral
adverb
conjunction
pronoun
negation
participle
copula
quantifier
numberExpression
modal
interrogative
existential
wPrefix
title
interjection
url

For more information about these corpora, see the full description.