This corpus contains almost 60 million words crawled from the Hindi Internet and it is tagged. The lexicon size has more than 1.4 million words. The corpus is prepared by Corpus factory method, full details find in Kilgarriff et al. at LREC 2010.

Special positional attributes

  • cpos – coarse POS tag that it is not derived from the attribute tag, see more in section 4.1 of tagset description (below)
  • hlemma/hword (heuristic) – tags where all the vowels are stripped, and just the consonants appear. Most spelling variations are due to the usage of differents vowels, so in order to find similarly spelt words hlemma and hword becomes handy, e.g. ka (क) + e -> ki की
  • Tags with suffix “:?” are words which cannot be classified into the target tag linguistically but had to be classified due to the context.

Changelog

v3.0 (17th Jan 2012)

We recollected Hindi Web Corpus in 2011. The corpus size is 65 million tokens.

The corpus is tagged using a new POS tagger (91.31% accuracy), lemmatizer and morph analyser downloaded from http://sivareddy.in/downloads

The tagset details are described in POS guidelines for Indian languages (crawled from Wayback Machine at http://ltrc.iiit.ac.in/tr031/posguidelines.pdf)

Sketch Grammar is revised with new rules which make use of post-position markers (which are crucial in Hindi dependency parsing. More rules to be added. We invite collaborations from the interested parties.)

v2.0 (6th Jan 2012)

The corpus is tagged using POS tagger downloaded from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow_parser.php.

The tagset details are described in see POS guidelines for Indian languages (crawled from Wayback Machine at http://ltrc.iiit.ac.in/tr031/posguidelines.pdf)

We wrote a simple sketch grammar for Hindi and generated first word sketches for Hindi. If you would like to contribute, please contact us.