heTenTen: Corpus of the Hebrew Web

The Hebrew Web Corpus (heTenTen) is a language corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

Data was crawled by the SpiderLing web spider in August 2014 and comprise of more than 890 million words from 20,000 web domains. The heTenTen was tokenised by modified Yoav Goldberg’s Hebrew tokeniser.

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Versions Hebrew TenTen corpora

Sketch Engine provides two version of heTenTen corpora.

  • Hebrew Web 2014 (heTenTen14, no POS tagging) – non-tagged version with the access for all users with the account in Sketch Engine
  • Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) – part-of-speech tagged version with the access for the academic purpose only. In this version, there are available word sketches and thesaurus, the features requiring the POS tagging.

Hebrew Web corpus POS tagged by Meni Adler‘s tool

The heTenTen was annotated with morphological categories by Meni Adler‘s tool with thanks to Noam Ordan.

Access policy

Access to the corpus is only limited to academic use. To gain access, send an email to support@sketchengine.co.uk with a proof of your academic affiliation.

Token attributes in heTenTen corpus with POS tagging

Hebrew is a language with a rich morphology. The tagger output in Hebrew has 27 different attributes, some of which are relevant only to certain parts of speech. Apart from usual attributes such as word, tag and lemma, there are transcriptions to latin alphabet, various morphology categories (affixes), various grammatical categories (gender, number, person, case). The tagset reference can be found below. More about the annotation.

Whenever an attribute is irrelevant or missing, its value is string ‘NULL’. For example, to search for all nouns in the corpus which are prefixed with the definite article, issue the following CQL query: [tag=”noun” & prefdefinite=”NULL”].

Tokens, lemmas, prefixes and suffixes were transliterated according to the following key:
‫ת ש ר ק צ פ ע ס נ מ ל כ י ט ח ז ו ה ד ג ב א‬
a b g d h w z x v i k l m n S y p c q r e T
Whereas the token and lemma attributes are available both in Hebrew alphabet and romanised transliteration, the prefix string and the suffix string appear only in transliteration.

Full name

Attribute

Values

Word word
Transliteration of the word trans
Lemma lemma
Transliteration of thelemma transl
Part of speech tag adjective, adverb, conjunction, copula, existential, foreign, interjection, interrogative, modal, negation, noun, numberExpression, numeral, participle, preposition, pronoun, properName, punctuation, quantifier, title, url, verb, wPrefix
Part of speech type postype amount and arithmetic-operation, bracket-end, bracket-start, colon, comma, coordinating, demonstrative, determiner, dot, exclamation-mark, gematria, hyphen, impersonal, literal-number, numeral-cardinal, numeral-fractional, numeral-ordinal or other, partitive, personal, proadverb, prodet, pronoun, question-mark, quote, reflexive, relativizing, semicolon, slash, subordinating, yesno
Prefix string prestring ‫ב בכ ו וב ובכ וכ וכש וכשל ול ומ ומכ ומש‬ ‫וש ושב ושל ושמ כ ככ כש כשב כשל כשמ‬ ‫ל לכ לכש מ מכ מש משב משכ משל משמ‬ ‫ש שב שכ שכש שכשמ של שמ שמש‬
Base string basestring
Suffix string sufstring ‫גם ה הם הן ו י ך כם כן ם ן נו‬
Gender gender feminine, masculine, masculine-and-feminine
Number number dual, dual-and-plural, plural, singular, singular-and-plural
 Status status absolute, construct
Polarity polarity negative, positive
Person person 1, 2, 3, any
Tense tense beinoni, future, imperative, infinitive, past
Binyan binyan Hifil, Hitpael, Hufal, Nifal, Paal, Piel, Pual
Prefix conjunction prefconj conjunction
Prefix definite article prefdefinite definiteArticle
Prefix interrogative prefinterrog
Prefix preposition prefprep preposition
Prefix subordination conjunction / relativizer relativizer relativizer / subordinatingConjunction
Prefix temporal subordinating conjunction preftemp temporalSubConj
Prefix adverb prefadv adverb
Suffix function suffunction accusative-or-nominative, possessive, pronomial
Suffix number sufnum plural, singular
Suffix gender sufgender feminine, masculine, masculine-and-feminine
Suffix person sufper 1, 2, 3

Tools to work with the Hebrew Web corpus

A complete set of Sketch Engine tools is available to work with this Hebrew corpus to generate:

  • word sketch – Hebrew collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Hebrew nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 1 (January 2015)

  • PoS tagged & annotated with morphological categories

initial version (September 2014)

  • Crawled by SpiderLing in August 2014
  • 1.061 billion tokens

Search the Hebrew corpus

Sketch Engine offers a range of tools to work with the Hebrew Web corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 400+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.