HindiWaC: Hindi Web as Corpus

The HindiWaC corpus is a Web corpus of the Hindi language.  This corpus contains more than 100 million words crawled from the Hindi Internet during 2012.

Texts in the corpus are lemmatized and morphologically tagged. The corpus has a word sketch grammar enables users to explore the grammatical and collocational behavior of Hindi words. The whole process corpus preparation is described in the Corpus factory method document (Kilgarriff et al. at LREC 2010).

See the Hindi part-of-speech tagset describing POS tags used in the corpus.

Special positional attributes in the 3rd version of the corpus

  • cpos – coarse POS tag that it is not derived from the attribute tag, see more in section 4.1 of tagset description (below)

Attributes only in the 3rd version of the corpus

  • hlemma/hword (heuristic) – tags where all the vowels are stripped, and just the consonants appear. Most spelling variations are due to the usage of differents vowels, so in order to find similarly spelt words hlemma and hword becomes handy, e.g. ka (क) + e -> ki की
  • Tags with suffix “:?” are words which cannot be classified into the target tag linguistically but had to be classified due to the context

Changelog

v4.0 (10th Feb 2017)

  • size 107 million words
  • improved sketch grammar
  • removed special positional attributes: hlemma and hword

v3.0 (17th Jan 2012)

v1.0 (dec 2009)

  • initial size 27 million words
  • created by Siva Reddy
  • no part-of-speech tagging

Bibliography

Eragani, A. K., Kuchibhotla, V., Sharma, D. M., Reddy, S., & Kilgarriff, A. (2014). Hindi Word Sketches. In Proceedings the 11th International Conference on Natural Language Processing (ICON).

Search the Hindi WaC corpus

Sketch Engine offers a range of tools to work with the Hindi WaC corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.